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Preface 


Clinical epidemiology provides the scientific basis for the practice of medicine, because it 
focuses on the diagnosis, prognosis, and management of human disease. Therefore, issues 
of research design, measurement, and evaluation are critical to clinical epidemiology. This 
volume, Clinical Epidemiology: Practice and Methods , is intended to educate researchers on 
how to undertake clinical research and should be helpful not only to medical practitioners 
but also to basic scientists who want to extend their work to humans, to allied health pro¬ 
fessionals interested in scientific evaluation, and to trainees in clinical epidemiology. 

This book is divided into six parts. The first three introductory chapters focus on how 
to frame a clinical research question, the ethics associated with doing a research project in 
humans, and the definition of various biases that occur in clinical research. Parts II-IV 
examine issues of design, measurement, and analysis associated with various research 
designs, including determination of risk in longitudinal studies, assessment of therapy in 
randomized controlled clinical trials, and evaluation of diagnostic tests. Part V focuses on 
the more specialized area of clinical genetic research. Part VI provides the basic methods 
used in evidence-based decision making including critical appraisal, aggregation of multiple 
studies using meta-analysis, health technology assessment, clinical practice guidelines, 
development of health policy, translational research, how to utilize administrative data¬ 
bases, and knowledge translation. 

This collection provides advice on framing the research question and choosing the 
most appropriate research design, often the most difficult part in performing a research 
project that could change clinical practice. It discusses not only the basics of clinical epide¬ 
miology but also the use of biomarkers and surrogates, patient-reported outcomes, and 
qualitative research. It provides examples of bias in clinical studies, methods of sample size 
estimation, and an analytic framework for various research designs, including the scientific 
basis for multivariate modeling. Finally, practical chapters on research ethics, budgeting, 
funding, and managing clinical research projects may be useful. 

The content of this book can be divided into two categories: The basics of clinical epi¬ 
demiology and more advanced chapters examining the analysis of longitudinal studies 
(Chapters 5-8) and randomized controlled trials (Chapters 13-15). Examples and case 
studies have been encouraged. 

All the contributors to this volume are practicing clinical epidemiologists, who hope 
the reader will join them in doing research focused on improving clinical outcomes. 

St. John’s, NL y Canada Patrick S. Parfrey 

Brendan J. Barrett 
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Introduction 



Chapter 1 


On Framing the Research Question and Choosing 
the Appropriate Research Design 

Patrick S. Parfrey and Pietro Ravani 

Abstract 

Clinical epidemiology is the science of human disease investigation with a focus on diagnosis, prognosis, and 
treatment. The generation of a reasonable question requires definition of patients, interventions, controls, 
and outcomes. The goal of research design is to minimize error, to ensure adequate samples, to measure 
input and output variables appropriately, to consider external and internal validities, to limit bias, and to 
address clinical as well as statistical relevance. The hierarchy of evidence for clinical decision-making places 
randomized controlled trials (RCT) or systematic review of good quality RCTs at the top of the evidence 
pyramid. Prognostic and etiologic questions are best addressed with longitudinal cohort studies. 

Key words Clinical epidemiology, Methodology, Research design, Evidence-based medicine, 
Randomized controlled trials, Longitudinal studies 


1 Introduction 


Clinical epidemiology is the science of human disease investiga¬ 
tion, with a focus on problems of most interest to patients: diagno¬ 
sis, prognosis, and management. 

Articles are included in this book on the design and analysis of 
cohort studies and randomized controlled trials. Methodological 
issues involved with studies of biomarkers, quality of life, genetic 
diseases, and qualitative research are evaluated. In addition, chapters 
are presented on the methodology associated with aggregation of 
multiple studies such as meta-analysis, pharmacoeconomics, health 
technology assessment, and clinical practice guidelines. Finally, 
important issues involved in the practice of clinical epidemiology 
such as ethical approval and management of studies are discussed. In 
this chapter we consider how to frame the research question, error 
definition, measurement, sampling, the choice of research design, 
and the difference between clinical relevance and statistical signifi¬ 
cance. In addition, here we provide an overview of principles and 
concepts that are discussed in more detail in subsequent chapters. 
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2 Framing the Clinical Research Question 

The research process usually starts with a general idea or initial 
problem. 

Research ideas may originate from practical clinical problems, 
request for proposals by funding agencies or private companies, 
reading the literature and thinking of ways to extend or refine pre¬ 
vious research, or the translation of basic science discoveries to the 
clinic or the community. Literature review is always required to 
identify related research, to define the knowledge gap, to avoid 
redundancy when the answer is already clear, and to set the research 
within a proper conceptual and theoretical context based on what 
is already known. The next step is to generate a researchable ques¬ 
tion from the general idea. This stage of conceptualization should 
generate testable hypotheses and delineate the exposure-outcome 
relationship to be studied in a defined patient population, whether 
it be a study of prognosis, diagnosis, or treatment. Thus operation¬ 
alization of the proposed study requires characterization of the 
specific disease to be studied, establishment of the input variable or 
exposure (test, risk factor, or intervention) to be associated with an 
output or clinical outcome. The latter may be the gold standard in 
a diagnostic test, the clinical event in a cohort study evaluating risk, 
or the primary clinical outcome in a randomized trial of an inter¬ 
vention. Thus, the broad initial idea is translated into a feasible 
research project. Narrowing down the area of research is necessary 
to formulate an answerable question, in which the target popula¬ 
tion of the study is determined along with a meaningful effect 
measure—the prespecified study outcome. 

In framing researchable questions, it is crucial to define the 
Patients, Interventions, Controls, and Outcomes (PICO) of rele¬ 
vance. The study question should define the patients (P) to be 
studied (e.g., prevalent or incident), through clearly defined eligi¬ 
bility criteria. These criteria should specify the problem, the comor- 
bid conditions to include (because the answer(s) to the research 
question may vary by the condition, e.g., diabetes, cardiovascular 
disease); and those not to include (because for them the question 
may be of less interest or hardly answerable, e.g., those with short 
expected survival). Secondly, the type of exposure (intervention or 
prognostic factor or test; I) is defined, and its specifics (e.g., what 
does the exposure actually comprise). Next, the comparison group 
(C) is defined. Finally, the outcome of interest (O) is declared. 
Following consideration of the PICO issues the researchable ques¬ 
tion can then be posed, e.g., “Does a particular statin prevent car¬ 
diac events, when compared to conventional therapy, in diabetic 
patients with stage 3 and 4 chronic kidney disease”? 

The operationalization of the study must be consistent with its 
purpose. If the question is one of efficacy (Does it work in the ideal 
world?), then the measurement tools identified should be very 
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accurate, may be complex and expensive, and may not be necessar¬ 
ily useful in practice. Opposite considerations are involved in effec¬ 
tiveness studies (Does it work in the real world?), and trade-offs 
between rigor and practicality are necessary. Further operational 
steps in clinical research involve limiting error, whether it be ran¬ 
dom or systematic error, identifying a representative sample to 
study, determining a clinically relevant effect to assess, ensuring 
that the study is feasible, cost-effective, and ethical. 


3 Error 

The goal of clinical research is to estimate population characteristics 
(parameters) such as risks by making measurements on samples from 
the target population. The hope is that the study estimates be as 
close as possible to the true values in the population (accuracy) with 
little uncertainty (imprecision) around them (Table 1). However, an 
error component exists in any study. This is the difference between 

Table 1 

Precision and accuracy in clinical studies 



Precision 

Accuracy 

Definition 

Degree to which a variable has nearly the 
same value when measured several times 

Degree to which a variable actually 
represents what is supposed to 
represent 

Synonyms 

Fineness of a single measurement 
Consistency —agreement of repeated 
measurements ( reliability) or repeated 
sampling data ( reproducibility ) 

Closeness of a measurement 
or estimate to the true value 
Validity —agreement between 
measured and true values 

Value to the study 

Increase power to detect effects 

Increase validity of conclusions 

Threat 

Random error (variance) 

Systematic error (bias) 

Maximization: 

Increase sample size 

Randomization 

sampling 

Maximization: 

Variance reduction 

Bias prevention/control 

measurement 

Observer sources 

Procedure standardization, staff training 

Blinding 

Tool sources 

Calibration; automatization 

Appropriate instrument 

Subject sources 

Procedure standardization, repetition and 
averaging key measurements 

Blinding 

Assessment 

Repeated measures (test/retest, inter/ 
intra-observer: correlation, agreement, 
consistency) 

Comparison with a reference 
standard (gold standard; formal 
experiments, RCT) 


RCT randomized controlled trial 
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Inaccurate estimates 


Accurate estimates 
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Fig. 1 The effect of the error type on study results. Each panel compares the distribution of a parameter 
observed in a study (continuous lines) and the corresponding true distribution (dashed lines). Random error 
lowers the precision of the estimates increasing the dispersion of the observed values around the average 
(studies #3 and #4). Systematic error (bias) causes incorrect estimates or “deviations from the truth”: 
the estimated averages correspond to rings distant from the target center (studies #1 and #3) even if results 
are precise (study #1). With permission Ravani et al., Nephrol Dial Transpl [19] 


the value observed in the sample and the true value of the phenom¬ 
enon of interest in the parent population. 

There are two main types of error: random or accidental error, 
and systematic error (bias). Random errors are due to chance and 
compensate since their average effect is zero. Systematic errors are 
non-compensating distortions in measurement (Fig. 1). Mistakes 
caused by carelessness, or human fallibility (e.g., incorrect use of an 
instrument, error in recording or in calculations), may contribute to 
both random and systematic error. Both errors arise from many 
sources and both can be minimized using different strategies. 
However, their control can be costly and complete elimination may be 
impossible. Systematic error, as opposed to random error, is not lim¬ 
ited by increasing the study size and replicates if the study is repeated. 

Confounding is a special error since it is due to chance in 
experimental designs but it is a bias in non-experimental studies. 
Confounding occurs when the effect of the exposure is mixed 
with that of another variable (confounder) related to both expo¬ 
sure and outcome, which does not lie in the causal pathway 
between them. For example if high serum phosphate levels are 
found to be associated with higher mortality, it is important to 
consider the confounding effect of low glomerular filtration rate 
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in the assessment of the relationship between serum phosphate 
and death [ 1 ]. 

For any given study, the design should aim to limit error. 
In some cases, pilot studies are helpful in identifying the main 
potential sources of error (known sources of variability and bias— 
Table 1 ) such that the design of the main study can control them 
[2, 3]. Some errors are specific to some designs and are discussed in 
a subsequent chapter of this series. Both random and systematic 
errors can occur during all stages of a study, from conceptualization 
of the idea to sampling (participant selection) and actual measure¬ 
ments (information collection). 


4 Sampling 


Once the target population has been defined, the next challenge is 
to recruit study participants representative of the target popula¬ 
tion. The sampling process is important, as usually a small fraction 
of the target population is studied for reasons of cost and feasibil¬ 
ity. Errors in the sampling process can affect both the actual esti¬ 
mate and its precision (Table 1, Fig. 2). To reduce sampling errors 
researchers must set up a proper sampling system and estimate an 
adequate sample size. 

Recruitment of a random sample of the target population is 
necessary to ensure generalizability of study results. For example if 


Unbiased sample 


Biased sample 
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Fig. 2 Sampling bias. An unbiased sample is representative of and has the same characteristics as the popula¬ 
tion from which it has been drawn. A biased sample is not representative of the target population because its 
characteristics have different distribution as compared to the original population. With permission Ravani 
et al., Nephrol Dial Transpl [19] 
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we wish to estimate the prevalence of Chronic Kidney Disease 
(CKD) in the general population, the best approach would be to 
use random sampling, possibly over-sampling some subgroup of 
particular interest (e.g., members of a racial group) in order to 
have sufficiently precise estimates for that subgroup [4, 5]. In this 
instance, a sample of subjects drawn from a nephrology or a dia¬ 
betic clinic, any hospital department, school, workplace or people 
walking down the street would not be representative of the general 
population. The likelihood of CKD may be positively or negatively 
related to factors associated with receiving care or working in a 
particular setting. On the other hand, if a study aimed at under¬ 
standing the characteristics of patients with CKD referred to a 
nephrologist, a study of consecutive patients referred for CKD 
would probably provide a reasonably generalizable result. 

If the purpose of the study is to estimate a measure of effect 
due to some intervention, then the sampling problem is not fin¬ 
ished. Here the comparability of study groups, other than with 
regard to the exposure of interest, must be ensured. Indeed to 
measure the effect of a therapy, we need to contrast the experience 
of people given the therapy to those not so treated. However, peo¬ 
ple differ from one another in myriad of ways, some of which might 
affect the outcome of interest. To avoid such concerns in studies of 
therapy, random assignment of study participants to therapy is rec¬ 
ommended to ensure comparability of study groups in the long 
run. These must be of sufficient size to reduce the possibility that 
some measurable or unmeasurable prognostic factors be associated 
with one or other of the groups (random confounding). 

The randomization process consists of three interrelated 
maneuvers: the generation of random allocation sequences; strate¬ 
gies to promote allocation concealment; and intention-to-treat 
analysis. Random sequences are usually generated by means of com¬ 
puter programs. The use of calendar or treatment days, birth dates, 
etc. is not appropriate since it does not guarantee unpredictability. 
Allocation concealment is meant to prevent those recruiting trial 
subjects from the knowledge of upcoming assignment and protect 
selection biases. Useful ways to implement concealed allocation 
include the use central randomization, or the use of sequentially 
numbered sealed opaque envelopes. Intention-to-treat analysis 
consists in keeping all randomized patients in their original assigned 
groups during analysis regardless of adherence or any protocol 
deviations. This is necessary to maintain group comparability. 


5 Sample Size Estimation 

When planning a comparative study two possible random errors 
(called type I and II errors) are considered. A type I error is made 
if the results of a study have a statistically significant result when in 
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fact there is no difference between study groups. This risk of false 
negative results is commonly set at 5 % (equivalent to a significant 
P value of 0.05). A Type II error is made if the results of a study 
are non-significant when in fact a difference truly exists. This risk 
of false positive results is usually set at 10 or 20 %. The other fac¬ 
tors that determine how large a study should be are the size of the 
effect to be detected and the expected outcome variability. Different 
formulae exist to estimate the sample size depending on the type of 
response variable and the analytical tool used to assess the input- 
output relationship [6]. In all studies the sample size will depend 
on the expected variability in the data, effect size (delta), level of 
significance (alpha error), and study power (1-beta error). 


6 Measurement 

6.1 Variable Types As in all sciences, measurement is a central feature of clinical epide¬ 
miology. Both input and output variables are measured on the 
sample according to the chosen definitions. Inputs can be mea¬ 
sured once at baseline if their value is fixed (e.g., gender), or more 
than once if their value can change during the study (such as blood 
pressure or type of therapy). Outputs can also be measured once 
(e.g., average blood pressure values after 6 months of treatment) 
or multiple times (repeated measures of continuous variables such 
as blood pressure or events such as hospital admissions). The infor¬ 
mation gained from input and output variables depends on the 
type of observed data, on whether it be qualitative nominal (unor¬ 
dered categories), qualitative ordinal (ordered categories), quanti¬ 
tative interval (no meaningful zero), or quantitative ratio (zero is 
meaningful). 

In clinical epidemiology the type of outcomes influences study 
design and determines the analytical tool to be used to study the 
relationship of interest. 

Intermediate variables are often considered surrogate out¬ 
come candidates and used as an outcome instead of the final end¬ 
point, to reduce the sample size and the study cost (Table 2). 
Candidate surrogate outcomes are many and include measures of 
the underlying pathological process (e.g., vascular calcification), 
or of preclinical disease (e.g., left ventricular hypertrophy). 
However, well-validated surrogate variables highly predictive of 
adverse clinical events, such as systolic blood pressure and LDL 
cholesterol, are very few and only occasionally persuasive (Table 3 ). 
Furthermore, although these surrogates may be useful in studies 
of the general population, their relationship with clinical out¬ 
comes is not linear in some conditions making them less useful in 
those settings [7, 8]. Hard outcomes that are clinically important 
and easy to define are used to measure disease occurrence as well 
as to estimate the effects of an exposure. 
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Table 2 

Comparison between final outcome and intermediate (surrogate) response 



Surrogate marker 

Hard end-point 

Definition 

Relatively easily measured variables which 
predict a rare or distant outcome 

The real efficacy measure of a 
clinical study 

Use 

May replace the clinical end-point; provide 
insight into the causal pathway 

Response variable of a clinical study 
(outcome) 

Example 

Brain natriuretic peptide; left ventricular 
hypertrophy 

Death (from all and specific causes); 
cardiovascular or other specified 
events 

Advantages 

(1) Reduction of study sample size, duration 
and cost; (2) Assessment of treatments in 
situations where the use of primary outcomes 
would be excessively invasive or premature 

A change in the final outcome 
answers the essential questions on 
the clinical impact of treatment 


Disadvantages 1. A change in valid surrogate end-point does 

not answer the essential questions on the 
clinical impact of treatment 
2. It may lack some of the desired characteristics 
a primary outcome should have 


(1) Large sample size and long 
duration (cost) of the study; (2) 
Assessment of treatments may be 
premature or invasive 


Table 3 

Validity issues for a surrogate end-point to be tested in an RCT 


Surrogate marker validity: Is the plausible relationship Yes E -► S -► H 

between exposure (E) and the final hard outcome (H) 

fully explained by the surrogate marker (S)? No E -► S 

x i 

H 


Desired characteristics of a surrogate 1. Validity/reliability 

2. Availability, affordability; suitable for 
monitoring 

3. Dose-response relation predictive of the 
hard end-point 

4. Existence of a cutoff point for normality 

5. High sensitivity, specificity, +/- predictive 
values 

6. Changes rapidly/accurately in response to 
treatment 

7. Levels normalize in states of remission 
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6.2 Measurement Some systematic and random errors may occur during measure - 

Errors ment (Table 1). Of interest to clinical trials are the strategies to 

reduce performance bias (additional therapeutic interventions 
preferentially provided to one of the groups) and to limit informa¬ 
tion and detection bias (ascertainment or measurement bias) by 
masking (blinding) [9]. Masking is a process whereby people are 
kept unaware of which interventions have been used throughout 
the study, including when outcome is being assessed. Patient/ 
clinician blinding is not always practical or feasible, such as in trials 
comparing surgery with non-surgery, diets, and lifestyles. 

Finally, measurement error can occur in the statistical analysis 
of the data. Important elements to specify in the protocol include: 
definition of the primary and secondary outcome measure; how 
missing data will be handled (depending on the nature of the data 
there are different techniques); subgroup (secondary) analyses of 
interest; consideration of multiple comparisons and the inflation of 
the type I error rate as the number of tests increases; the potential 
confounders to control for; and the possible effect modifiers (inter¬ 
action). This issue has implication for modeling techniques and is 
discussed in subsequent chapters. 


7 External and Internal Validity 

The operational criteria applied in the design influence the external 
and internal validity of the study (Fig. 3). Both construct validity 
and external validity relate to generalization. However, construct 
validity involves generalizing from the study to the underlying 
concept of the study. It reflects how well the variables in the study 
(and their relationships) represent the phenomena of interest. For 
example, how well does the level of proteinuria represent the pres¬ 
ence of kidney disease? Construct validity becomes important 
when a complex process, such as care for chronic kidney disease, is 
being described. Maintaining consistency between the idea or con¬ 
cept of a certain care program and the operational details of its 
specific components in the study may be challenging. 

External validity involves generalizing conclusions from the 
study context to other people, places, or times. External validity is 
reduced if study eligibility criteria are strict, or the exposure or 
intervention is hard to reproduce in practice. The closer the 
intended sample is to the target population, the more relevant the 
study is to this wider, but defined, group of people, and the greater 
is its external validity. The same applies to the chosen intervention, 
control and outcome including the study context. The internal 
validity of a study depends primarily on the degree to which bias is 
minimized. Selection, measurement, and confounding biases can 
all affect the internal validity. 
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( Construct 
Validity 
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Ideal world: 
Thinking 


Empirical world: 
Doing / observing 


Generalization 


Inference 


Planning 

- — — — —. — — — — — — — — 
Dilution bias 
Attrition bias 
Analytical bias 


Target 
Population, 
PICO definition 
sStudy question/ 


External 
Validity 
Truth in the 
Universe 


Intended 

Sample, 

Intended 

PICO 


Internal 
Validity 
Truth in the 
Study 


Actual 
Sample, 
Actual 
leasurements/ 


Operationalization 

Design 


Sampling 

Implementation 


Selection bias 
Indication bias 
Confounding bias 


Detection bias 
Information bias 
Performance bias 


Randomization 


Masking ^ 


Fig. 3 Structure of study design. The left panel represents the design phase of a study, when Patient, 
Intervention, Control and Outcome (PICO) are defined (conceptualization and operationalization). The right 
panel corresponds to the implementation phase. Different types of bias can occur during sampling, data col¬ 
lection and measurement. The extent to which the results in the study can be considered true and generaliz- 
able depends on its internal and external validity. With permission Ravani et al., Nephrol Dial Transpl [19] 


In any study there is always a balance between external and 
internal validity, as it is difficult and costly to maximize both. 
Designs that have strict inclusion and exclusion criteria tend to 
maximize internal validity, while compromising external validity. 
Internal validity is especially important in efficacy trials to under¬ 
stand the maximum likely benefit that might be achieved with an 
intervention, whereas external validity becomes more important 
in effectiveness studies. Involvement of multiple sites is an 
important way to enhance both internal validity (faster recruit¬ 
ment, quality control, and standardized procedures for data 
collection, management, and analysis) and external validity (gen- 
eralizability is enhanced because the study involves patients from 
several regions). 
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8 Clinical Relevance vs. Statistical Significance 

The concepts of clinical relevance and statistical significance are 
often confused. Clinical relevance refers to the amount of benefit 
or harm apparently resulting from an exposure or intervention that 
is sufficient to change clinical practice or health policy. In planning 
study sample size, the researcher has to determine the minimum 
level of effect that would have clinical relevance. The level of statis¬ 
tical significance chosen is the probability that the observed results 
are due to chance alone. This will correspond to the probability of 
making a type I error, i.e., claiming an effect when in fact there is 
none. By convention, this probability is usually 0.05 (but can be as 
low as 0.01). The P value or the limits of the appropriate confi¬ 
dence interval (a 95 % interval is equivalent to a significance level 
of 0.05 for example) is examined to see if the results of the study 
might be explained by chance. If P< 0.05, the null hypothesis of no 
effect is rejected in favor of the study hypothesis, despite it is still 
being possible that the observed results are simply due to chance. 
However, since statistical significance depends on both the magni¬ 
tude of effect and the sample size, trials with very large sample sizes 
theoretically can detect statistically significant but very small effects 
that are of no clinical relevance. 


9 Hierarchy of Evidence 

Fundamental to evidence-based health care is the concept of “hier¬ 
archy of evidence” deriving from different study designs addressing 
a given research question (Fig. 4). Evidence grading is based on the 
idea that different designs vary in their susceptibility to bias and, 
therefore, in their ability to predict the true effectiveness of health 
care practices. For assessment of interventions, randomized con¬ 
trolled trials (RCTs) or systematic review of good quality RCTs are 
at the top of the evidence pyramid, followed by longitudinal cohort, 
case-control, cross-sectional studies and case series at the bottom 
[10]. However, the choice of the study design depends on the ques¬ 
tion at hand and the frequency of the disease. Intervention questions 
ideally are addressed with experiments (RCTs) since observational 
data are prone to unpredictable bias and confounding that only the 
randomization process will control. Appropriately designed RCTs 
allow also stronger causal inference for disease mechanisms. 

Prognostic and etiologic questions are best addressed with lon¬ 
gitudinal eohort studies in which exposure is measured first and 
participants are followed forward in time. At least two (and possibly 
more) waves of measurements over time are undertaken. Initial 
assessment of an input-output relationship may derive from 
ease-eontrol studies where the direction of the study is reversed. 
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Fig. 4 Examples of study designs. In cross-sectional studies inputs and output are measured simultaneously 
and their relationship is assessed at a particular point in time. In case-control studies participants are identi¬ 
fied based on presence or absence of the disease and the temporal direction of the inquiry is reversed (retro¬ 
spective). Temporal sequences are better assessed in longitudinal cohort studies where exposure levels are 
measured first and participants are followed forward in time. The same occurs in randomized controlled trials 
(RCTs) where the assignment of the exposure is under the control of the researcher. With permission Ravani 
et al., Nephrol Dial Transpl [20]. P Probability (or risk) 


Participants are identified by the presence or absence of disease and 
exposure is assessed retrospectively. Cross-sectional studies may be 
appropriate for an initial evaluation of the accuracy of new diagnos¬ 
tic tests as compared to a gold standard. Further assessments of 
diagnostic programs are performed with longitudinal studies 
(observational and experimental). Common biases afflicting obser¬ 
vational designs are defined in Chapter 3 and discussed in more 
detail in Chapter 20. 


10 Experimental Designs for Intervention Questions 

The RCT design is appropriate for assessment of the clinical effects 
of drugs, procedures, or care processes, definition of target levels 
in risk factor modification (e.g., blood pressure, lipid levels, and 
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proteinuria), and assessment of the impact of screening programs. 
Comparison to a placebo may be appropriate if no current stan¬ 
dard therapy exists. When accepted therapies exist (e.g., statins as 
lipid-lowering agents, ACE-I for chronic kidney disease progres¬ 
sion), then the comparison group is an “active” control group that 
receives usual or recommended therapy. 

The most common type of RCT is the two group parallel-arm 
trial (Fig. 4). However, trials can compare any number of groups. 
In factorial trials at least two active therapies (A, B) and potentially 
their combination (AB) are compared to a control (C). Factorial 
designs can be efficient since more therapies are simultaneously 
tested in the same study. However, the efficiency and the appropri¬ 
ate sample size are affected by the impact of multiple testing on 
both type I and type II error, and whether there is an interaction 
between the effects of the therapies. In the absence of interaction, 
the effect of A, for example, can be determined by comparing 
A+AB to B+C. Interactions where use of A enhances the effective¬ 
ness of B, for example, do not reduce the power of the study. 
However, if there is antagonism between treatments, the sample 
size can be inadequate. 

The crossover design is an alternative solution when the out¬ 
come is reversible [11]. In this design each participant serves as 
their own control by receiving each treatment in a randomly speci¬ 
fied sequence. A washout period is used between treatments to 
prevent carryover of the effect of the first treatment to the subse¬ 
quent periods. The design is efficient in that treatments are com¬ 
pared within individuals, reducing the variation due to subject 
differences. However, limitations include possible differential car¬ 
ryover (one of the treatments tends to have a longer effect once 
stopped); period effects (different response of disease to early ver¬ 
sus later therapy); and a greater impact of missing data because 
they compromise within-subjects comparison and therefore vari¬ 
ance reduction. 

Finally, RCTs may attempt to show that one treatment is not 
inferior (under a one-sided hypothesis) or equivalent (under a two- 
sided hypothesis) rather than superior to a comparable interven¬ 
tion. In non-inferiority trials the null hypothesis of inferiority is 
rejected if the effect of an intervention lies within a certain pre- 
specified non-inferiority margin. In equivalence trials the null 
hypothesis of non-equivalence is rejected if the effect of an inter¬ 
vention lies within two prespecified margins. These studies are 
often done when new agents are being added to a class (e.g., 
another ACE inhibitor), or when a new therapy is already known 
to be cheaper or safer than an existing standard. In such RCTs the 
study size is estimated based on a prespecified maximum difference 
that would still be considered irrelevant. For example, the claim 
might be made that a new ACE inhibitor is non-inferior to 
Enalapril, if the mean 24-h blood pressure difference between 
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them was no more than 3 mmHg. Non-inferiority trials have been 
criticized, as imperfections in study execution, which tend to pre¬ 
vent detection of a difference between treatments, actually work in 
favor of a conclusion of non-inferiority. Thus, in distinction to the 
usual superiority trial, poorly done studies may lead to the desired 
outcome for the study sponsor. 


11 Designs for Diagnostic Questions 

When assessing a diagnostic test the reference or “gold standard” 
tests for the suspected target disorders are often either inaccessible 
to clinicians or avoided for reasons of cost or risk. Therefore, the 
relationship between more easily measured phenomena (patient 
history, physical and instrumental examination, and levels of con¬ 
stituents of body fluids and tissues) and the final diagnosis is an 
important subject of clinical research. Unfortunately, even the 
most promising diagnostic tests are never completely accurate. 

Clinical implications of test results should ideally be assessed in 
four types of diagnostic studies. Table 4 shows examples from 
diagnostic studies of troponins in coronary syndromes. As a first 
step, one might compare test results among those known to have 
established disease to results from those free of disease. Cross- 
sectional studies can address this question (Fig. 4). However, since 


Table 4 

Level of evidence in diagnostic studies using troponin as test (T) and acute myocardial infarction 
(AMI) as target disorder (D) 


Diagnostic question 

Direction 

Design 

Problems 

Example 

Ref 

Do D + patients have 

From D 

Cross- 

Reverse 

Difference in Troponin 


different levels of T? 

back to T 

sectional 

association 
Sampling bias 

levels by AMI +/- 


Are patients T + more 
likely to be D + ? 

From T to D 

Cross- 

sectional 

Effectiveness 
not assessed 
Sampling bias 

Troponin performance 
in distinguishing 

AMI +/- 

[12] 

Does the level of T 
predict D +/_ ? 

From T to D 

Longitudinal Missing data 

Sampling bias 

Outcome study in subject 
at risk for AMI 

[12] 

Do tested patients have 
better final outcomes 
than similar patients 
who do not? 

From T to D Experiment 

Missing data 

Outcome (randomized) 
comparison in subject 
at risk for AMI 

[14] 


Positive (+); Negative (-). Missing data are possible in longitudinal or experimental designs: e.g., subjects lost before 
assessment or with data not interpretable. Strategies should be set up to (1) minimize the likelihood of missing informa¬ 
tion and (2) plan how subjects with missing information can be treated avoiding their exclusion (e.g., sensitivity analysis, 
propensity analysis) 
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the direction of interpretation is from diagnosis back to the test, 
the results do not assess test performance. To examine test perfor¬ 
mance requires data on whether those with positive test results are 
more likely to have the disease than those with normal results [12]. 
When the test variable is not binary (i.e., when it can assume more 
than two values) it is possible to assess the trade-off between sensi¬ 
tivity and specificity at different test result cutoff points [13]. In 
these studies it is crucial to ensure independent blind assessment of 
results of the test being assessed and the gold standard to which it 
is compared, without the completion of either being contingent on 
results of the other. 

Longitudinal studies are required to assess diagnostic tests 
aimed at predicting future prognosis or development of established 
disease [12]. The most stringent evaluation of a diagnostic test is 
to determine whether those tested have more rapid and accurate 
diagnosis, and as a result better health outcomes, than those not 
tested. The RCT design is the proper tool to answer this type of 
question [14]. 


12 Maximizing the Validity of Non-experimental Studies 

When randomization is not feasible the knowledge of the most 
important sources of bias is important to increase the validity of 
any study. This may happen for a variety of reasons: when study 
participants cannot be assigned to intervention groups by chance 
either for ethical reasons (e.g., in a study of smoking) or partici¬ 
pant willingness (e.g., comparing hemodialysis to peritoneal dialy¬ 
sis); the exposure is fixed (e.g., gender); or the disease is rare and 
participants cannot be enrolled in a timely manner. When strate¬ 
gies are in place to prevent bias, results of non-experimental studies 
may approach those of rigorous RCTs. 


13 Reporting 


Adequate reporting is critical to the proper interpretation and 
evaluation of any study results. Guidelines for reporting primary 
(CONSORT, STROBE, and STARD for example) and secondary 
studies (PRISMA) are in place to help both investigators and 
consumers of clinical research [15-18]. Scientific reports may not 
fully reflect how the investigators conducted their studies, but 
the quality of the scientific report is a reasonable marker for how 
the overall project was conducted. The interested reader is 
referred to the above referenced citations for more details of what 
to look for in reports from prognostic, diagnostic, and interven¬ 
tion studies. 
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Chapter 2 


Research Ethics for Clinical Researchers 

John D. Harnett and Richard Neuman 

Abstract 

This chapter describes the history of the development of modern research ethics. The governance of 
research ethics is discussed and varies according to geographical location. However, the guidelines used for 
research ethics review are very similar across a wide variety of jurisdictions. The paramount importance of 
protecting the privacy and confidentiality of research participants is discussed at length. Particular emphasis 
is placed on the process of informed consent, and step-by-step practical guidelines are described. The issue 
of research in vulnerable populations is touched upon and guidelines are provided. Practical advice is 
provided for researchers to guide their interactions with research ethics boards. Issues related to scientific 
misconduct and research fraud are not dealt with in this paper. 

Key words Ethics, Informed consent, Privacy, Confidentiality, Inclusiveness, Protection of human 
research participants, Vulnerable populations, Risk-benefit assessment, Tri Council Policy Statement 
(TCPS) 


1 Research Ethics Development 

One of the earliest guides for the ethical conduct of research on 
humans was provided by Virchow in the Berlin Code of 1900 [1]. 
The code outlined the requirement for informed consent, excluded 
participation of minors and those incompetent, and allowed 
research only under the direction of the institute’s medical direc¬ 
tor. As with most codes for the ethical conduct of research, the 
Berlin Code arose from the public outcry over unethical research. 
In this case a ‘Treatment” for syphilis, consisting of serum from 
“recovering” syphilis patients, was administered to prostitutes 
without their knowledge or consent resulting in the spread of 
syphilis among the prostitutes and their clients [ 1 ]. 

The Nuremberg Code (1949; [2]) was conceived by the 
prosecution as part of the case against physicians conducting 
“research” under the Nazi regime in Germany after World War II 
[3]. The Code describes the “legal, ethical and moral” basis on 
which research could be conducted in humans and served as the 
basis by which to decide whether research conducted by the 
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defendants met an acceptable public standard. The Nuremberg 
Code became part of the verdict of the war crimes trial and was 
later signed by the 51 charter members of the United Nations [3]. 
An expanded document, the Helsinki Declaration, derived from 
the Nuremberg Code, but outlining in detail the conduct of 
acceptable biomedical research was approved by the World Medical 
Association in 1964 [4]. 

It should be understood that although the principles and con¬ 
duct of human research had been to some extent codified, there 
remained no generalized requirement for mandatory ethical review 
for research on humans. No systematic process was in place to 
ensure independent and impartial review to judge whether a 
research project was ethically sound. It was simply left to the inves¬ 
tigator to see that the research was designed and executed in an 
acceptable manner. However, landmark publications by Pappworth 
(1965; [5]) and Beecher (1966; [6]) documented numerous stud¬ 
ies in the UK and the USA that failed to meet such standards. 
Standards of consent and protection of vulnerable populations were 
repeatedly violated in the most egregious manner. Beecher felt that 
ethical conduct should not be decided by a board or panel, but 
instead was the responsibility of the investigator. However, public 
awareness and outrage over the Tuskegee study led to an outcry for 
action that went beyond the investigator [7]. The Tuskegee study, 
which started in 1930 and continued until its termination in 1972, 
employed deception, enticement, and unwarranted medical inva¬ 
siveness while following the natural course of syphilis in 400 African 
American males who were consciously denied access to medical 
treatment. Despite concerns raised within the US Public Health 
Service, review by the Center for Disease Control in 1969 allowed 
the study to continue. The unethical conduct in human research 
documented by Beecher and revealed in the Tuskegee study led to 
passage of the National Research Act in 1974 [8] which institution¬ 
alized mandatory ethical review for all biomedical and behavioral 
research on humans and set the stage for the Belmont Report [8]. 

Unlike the previous ethical codes, the Belmont Report (1979; 
[8]) established a set of ethical principles underpinning the regula¬ 
tory framework for research on humans. The principles are Respect 
for Persons, Beneficence, and Justice. Respect for Persons recog¬ 
nizes that humans are autonomous agents and as such must give 
informed consent to participate in research. Moreover, their pri¬ 
vacy must be respected and whatever data is collected from their 
participation must be held in a confidential manner. Members of 
vulnerable populations, e.g., children and the institutionalized, 
require additional measures to ensure their protection. Humans 
must not be considered the means to an end, i.e., the generation of 
research results. Beneficence obligates the investigator to design 
research so as to maximize the benefits and minimize harms. For 
each study the risks and the benefits must be evaluated and risks 
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must be justified by potential benefits. Justice requires the fair 
treatment of participants. Those who are likely to share in the 
potential benefits of the research should equally share in the risks. 
Vulnerability should not be exploited to provide a pool of research 
participants, nor should vulnerability exclude a group that might 
benefit from the research. It is important to appreciate that a par¬ 
ticular research design may bring these principles into conflict and 
that no principle trumps another in the ethical review process. 


2 Governance 


Ensuring an unbiased evaluation of ethical acceptability requires a 
governance structure that minimizes real and perceived conflicts of 
interest by the investigator, the institution, and members of the 
ethics review committee. Human participants are the means by 
which research results are generated and as such may be exploited 
by: (1) the investigator interested in achieving financial gain or 
career advancement; (2) the institution which may gain status, 
overhead funding, or a share in patents and other intellectual prop¬ 
erty arising from the research enterprise; and (3) members of the 
review body which may have personal or financial interest in the 
research outcome. Unfortunately, examples of such exploitation at 
the level of the investigator, institution, and review committee are 
readily available. 

Canadian Research Ethics Boards (Institutional Review Boards, 
US; Research Ethics Committees, UK) require members with sci¬ 
entific expertise commensurate with the research under review 
(unscientific research is by definition unethical), expertise in bioeth¬ 
ics and relevant law, and representation by the community, the 
group that is the beneficiary of research in the widest sense. Review 
committees must follow nationally or internationally accepted regu¬ 
lations or guidelines for the conduct of human research, e.g., Good 
Clinical Practice [9], Tri-Council Policy Statement (TCPS; Canada; 
[10]), and Common Rule (US; [11]). As well, institutions must 
have policies in place to assure independence of the ethical review 
process. The ethics review body must have written policies and 
standard operating procedures outlining the detailed operations 
of the review process and supporting infrastructure and to ensure 
procedures for research oversight and continuing or ongoing ethical 
review are in place. 


3 Privacy and Confidentiality 

Research participates have a right to expect that their privacy 
will be protected and that data collected will be maintained in a 
confidential manner by the investigator and study personnel. 
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Clearly maximum protection is provided when data is collected 
anonymously, e.g., a survey is completed without disclosing per¬ 
sonal information or demographic data that would allow identifi¬ 
cation of the participant. Collecting identifiable data should be 
justified in relation to the expected benefits of the study. Once 
collected the identifier should be coded and only the coded identi¬ 
fier should be stored with the data. The file containing identifiers 
should be stored in a password protected file or locked file cabinet 
in a locked room. Access to identifiers should be strictly limited to 
study personnel on a need to know basis. Retention of identifiable 
data should be limited in time consistent with institutional policies 
on research integrity. Long-term retention of such data requires 
justification. Assurance regarding protection of privacy and confi¬ 
dentiality should be outlined in the consent form or as part of the 
consent process. Moreover, when confidentiality cannot be main¬ 
tained, as in a focus group setting, or when privacy is clearly com¬ 
promised in those cases where facial photographs are published to 
describe a genetic or medical condition, this must be emphasized 
in the consent form. 

Public concerns with issues of privacy and confidentiality have 
resulted in extensive legislation guiding the use and dissemination 
of personal information and in particular the use of personal health 
information [12]. Investigators and ethics review committees must 
be aware of this legislation and how it may impact research. 
Moreover, in many jurisdictions ethics review committees have the 
authority to grant approval for the use of personal health informa¬ 
tion in the absence of informed consent when such use can be 
justified by the nature of the research or the feasibility of obtaining 
consent is in question or poses additional risks. 


4 Composition of a Research Ethics Board 

GCP [9] outlines guidelines for the minimum required member¬ 
ship for a research ethics board. It states that the IRB/IEC 
should consist of a reasonable number of members, who collec¬ 
tively have the qualifications and experience to review and evaluate 
the science, medical aspects, and ethics of the proposed trial. It is 
recommended that the IRB/IEC should include: 

1. At least five members. 

2. At least one member whose primary area of interest is in a 
nonscientific area. 

3. At least one member who is independent of the institution/ 
trial site. 
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5 How a Research Ethics Board Functions 

As well as being aware of the composition of the REB, applicants 
should also have some appreciation as to how their local REB 
functions. Applications are screened by the cochairs and those con¬ 
sidered to be of minimal risk are triaged for expedited review by 
one board member and cochair. If approval is recommended, this 
is brought to the full REB for ratification only. No further review 
occurs. If expedited review identifies important ethical issues, then 
the proposal goes to the full board for review. 

In situations in which more than minimal risk is involved, the 
application goes to the full board for review. One member is 
assigned the task of detailed review and presentation. The primary 
reviewer receives the detailed protocol, if available. All members of 
the board read each application and all applications are discussed at 
the board meetings which in our institution occur every 2 weeks. 
Decisions are generally arrived at by consensus, although a vote is 
taken for the record. Questions are communicated to the researcher. 
In cases where resolution of the issues proves difficult, the researcher 
may be invited to present in person to the board. This is generally 
not required and the majority of applications are approved in a 
timely fashion. If a proposal is not approved, the researcher has 
the right of appeal to a duly constituted independent appeals 
committee. 


6 Balancing Risks and Benefits 

One of the most important tasks of a research ethics board is 
deciding if the benefits of a proposed research project outweigh 
potential risks. In situations where more than minimal risk is 
involved, more intense scrutiny of the research is required includ¬ 
ing a scholarly review of the proposed research. In Canada the 
TCPS defines minimal risk as “If potential subjects can reasonably 
be expected to regard the probability and magnitude of possible 
harms implied by participation in the research to be no greater 
than those encountered by the subject in those aspects of his or her 
everyday life that relate to the research then the research can be 
regarded as within the range of minimal risk.” 

Scholarly review is generally done in the setting of peer review. 
This poses significant logistical problems for research ethics boards. 
A true peer review process is time-consuming and could impede 
timely review and approval of research proposals. There are several 
approaches to this issue. In large institutions a separate peer review 
process may be in place. This does delay the timeframe of ethical 
review. Sometimes funding for the proposal is already secured and 
comments from a granting agency peer review panel may be available. 
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More commonly the research ethics board is sufficiently expert and 
diverse to provide a reasonable assessment of the scientific validity 
of a research proposal. This review is critically dependent on the 
quality and clarity of the submission provided by the researcher. 
Comments on scientific validity are often perceived by researchers 
as beyond the purview of a research ethics board. However, in 
situations where more than minimal risk is involved, a research 
ethics board has the obligation to assess the scientific validity of the 
proposed research. 

A final decision on the risk-benefit ratio of a research proposal 
involves a review of the quality of the proposal, the likely side 
effects of the proposed intervention and the potential benefits to 
participants. Ultimately it is a judgment call of an appropriately 
constituted research ethics board. In situations where doubt arises, 
a formal presentation by the researcher to the ethics board may be 
helpful. 


7 Informed Consent 


Free and informed consent is a cornerstone of ethical research 
involving human subjects. It begins with the initial contact and 
must be sustained until the end of the involvement of the subjects 
in the research project. Free and informed consent is an iterative 
process whereby research subjects are informed in understandable 
terms about the details of the proposed research. While each orga¬ 
nization is likely to have their own informed consent template a 
template developed in Newfoundland and Labrador in Canada 
[13] provides a practical guide to developing an informed consent 
document for a clinical trial and addresses the key important ques¬ 
tions. This type of approach could easily be modified for other 
types of research designs. 

What is a clinical trial? 

This section should address how a clinical trial differs from 
normal clinical care. It should address the concept of randomiza¬ 
tion and the possible applicability of the results of the clinical trial 
to others with a clinical condition similar to the subjects. 

Do I have to take part in this clinical trial? 

This section needs to stress the voluntary nature of participa¬ 
tion in a research project and an assurance of normal clinical care 
should the subject decide not to participate. 

Will this trial help me? 

For randomization to be ethical the response here has to be 
one indicating that benefit is uncertain. 

Why is this trial being done? 

This section should provide, in lay terms, the rationale for the 
research question. 

What is being tested? 
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Has the intervention been approved by the appropriate regulatory 
authorities or is this trial a step towards that approval process. Has 
the intervention been tested in animals and what, in lay terms was 
found? Has the intervention been tested in humans? How many 
were studied and what was shown? 

Why am I being asked to take part? 

This should include a statement as to how a particular indi¬ 
vidual was flagged for possible inclusion in the study. This should 
provide an assurance to the potential participant that their auton¬ 
omy and privacy has been protected during this process. 

Who can take part in this trial? 

This should clearly list the inclusion and exclusion criteria in 
understandable terms and must mirror those criteria outlined in 
the more detailed protocol. 

How long will I be in the trial? 

The research participant must be made aware of the overall 
duration of participation in the study. The amount of time involved 
in participating in trial activities must be explicitly stated. 

How many people will take part in this trial? 

Describe whether this is a single-center or multicenter study. 
If the latter is the case, indicate the number of local and overall 
participants. 

How is this trial being done? 

This section should provide a detailed but understandable 
description of the research methodology. This should include 
details of randomization and blinding as well as detailed descrip¬ 
tion of what the experimental and control arms entail. Details 
regarding proposed blood and tissue collection should be described. 
Clearly describe anything involved in the trial which is not part of 
standard clinical care. 

What about birth control and pregnancy? 

Most organizations have standard wording addressing these 
issues. This should include what is known of the risks of the inter¬ 
vention and what birth control measures (for both the research 
subject and any sexual partners) are necessary for inclusion in the 
study. There will often be uncertainty about possible teratogenic 
effects or effects on breastfeeding babies. In the absence of infor¬ 
mation it should be assumed that the possibility of such effects 
exists and individuals should be advised appropriately. 

Are there risks to the trial? 

Possible adverse effects of the intervention should be listed 
and grouped according to frequency. Risks of any other procedures 
being performed as a result of participation in the study must be 
outlined (e.g., additional radiation exposure as a consequence of 
imaging procedures that are part of the study and would not be 
done if normal clinical care applied.) Occasionally certain ques¬ 
tions on questionnaires may be distressing or uncomfortable for 
participants. Subjects should be given the option not to answer 
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such questions. If certain questions have a high likelihood of 
causing distress, appropriate support services, such as counseling, 
need to be in place and should be referred to in this section. 

Are there other choices? 

It must be clearly stated that the subject does not have to par¬ 
ticipate in the trial and a description of what other treatments are 
available should be provided. It should also be stated that once 
enrolled in the trial simultaneous enrollment in another clinical 
trial is not permissible. 

What are my responsibilities? 

It is important to point out that research participants should 
comply with the research protocol, report any changes in health 
status and provide updated information on the use of other 
medications. 

Can I be taken out of the trial without my consent? 

Research participants can be removed from a trial at the discre¬ 
tion of the investigator if he/she feels they are not complying with 
instructions or if their continued participation is harmful because 
of side effects or deterioration in their health status. The partici¬ 
pant must be informed of the reason for withdrawal from the trial 
in the event that this happens. 

What about new information? 

If new information becomes available that may affect the 
participant’s health status or willingness to continue in the study, 
this must be discussed with the participant. 

Will it cost me anything? 

Information regarding costs to the participant of being in the 
trial must be discussed. Reimbursement for expenses may be avail¬ 
able and the participant must be made aware of this. If payment of 
participants is planned, this must be outlined. Payments that con¬ 
stitute an inducement to participate or exposure to excessive risk 
are not allowed by research ethics boards. Provisions for payment 
for treatment of or compensation for research related injuries must 
be addressed in this section of the consent form. If information 
from the study results in a patented product of commercial value, the 
participant will not usually receive any financial benefit. This should 
be made clear to participants. 

What about my right to privacy? 

Research participants should be assured of privacy and confi¬ 
dentiality. Outside agencies may be privy to private and confiden¬ 
tial information for the purposes of audit or licensing. They are 
expected to observe strict confidentiality when examining the data. 
The participant must be informed of who will have access to their 
data. The duration of data storage must be specified. In the case of 
clinical trials this is generally 25 years after completion of data col¬ 
lection. Details of how the data will be stored and what steps will 
be taken to ensure secure and confidential storage must be pro¬ 
vided. Information on how confidentiality will be assured for any 
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blood or tissue collected must be specifically addressed and will 
vary depending on the study objectives and the nature of the blood 
and tissue being stored. 

What if I want to quit the study? 

Explain the procedure for withdrawal to the participant. Not 
uncommonly data collected up to the point of withdrawal will be 
retained and may be used in data analysis to ensure validity of the 
study. If this is the case, this must be disclosed to the research 
participant. If the participant has already agreed to blood or tissue 
storage for future use, he/she must be given the option to withdraw 
or affirm such an agreement at this point. 

What will happen to my sample after the trial is over? 

If the sample is to be destroyed, this should be specified. If it is 
to be used for future research, the sample may be coded to allow 
future linkage or it may be anonymized in which case future link¬ 
age to the participant will not be possible. The participant must be 
informed and provide consent for either option. If genetic material 
is to be used for future research, the participant must be informed 
if the possibility of re-contact is involved and consent to same. The 
participant may also wish to specify the types of future research 
that he/she would consent to (e.g., an individual might consent to 
future use of their DNA for a specific disease and not necessarily 
for unrestricted use for any research purpose). Studies involving 
future use of research samples are normally considered sub-studies 
and require a separate consent form to be signed addressing the 
issues outlined here. 

Declaration of financial interest: If the investigator has a finan¬ 
cial interest in conducting the trial, this should be declared. If no 
financial interest is involved, this should be stated. 

What about questions or problems? 

If the participant has questions about the trial or has a medical 
concern, he/she should be provided with contact information for 
the local principal investigator and study co-coordinator. If a medi¬ 
cal concern arises outside normal work hours, details of the process 
in place to contact help should be provided. 

If the participant has questions about their rights or concerns 
with the way in which the study is being conducted, appropriate 
contact information should be provided. The contact in this case 
will vary in different jurisdictions. It will often be through the 
office of the research ethics board. 

The signature page 

The signature page should include a statement that the partici¬ 
pant has had an ample opportunity to discuss the proposed 
research, that they understand the proposed research and have had 
their questions answered satisfactorily. It should indicate that they 
have been informed of who may access their research records and 
should indicate that they have the right to withdraw at any time 
subject to the conditions outlined in the consent form. 
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The form must be signed by the participant, an independent 
witness, the principal investigator and the individual who has per¬ 
formed the consent discussion (if not the principal investigator). 
The signature of the next of kin/legal guardian must be provided 
for certain types of research (e.g., research involving unemanci¬ 
pated minors and incompetent adults). If the consent form requires 
translation into another language, the signature of the translator is 
also required. 


8 Inclusiveness in Research 

Historically certain groups of individuals have been underrepre¬ 
sented and sometimes deliberately excluded from research. Such 
groups have included women, the elderly, children and incompe¬ 
tent adults. This list is not exhaustive and the reasons for exclusion 
of such groups are complex and varied and beyond the scope of 
this chapter. This issue has been specifically addressed by the 
Canadian TCPS in Section 5 [13] as follows: 

Where research is designed to survey a number of living research sub¬ 
jects because of their involvement in generic activities (e.g., in many 
areas of health research, or in some social science research such as stud¬ 
ies of child poverty or of access to legal clinics) that are not specific to 
particular identifiable groups, researchers shall not exclude prospective 
or actual research subjects on the basis of such attributes as culture, 
religion, race, mental or physical disability, sexual orientation, ethnic¬ 
ity, sex or age, unless there is a valid reason for doing so. 

This statement is based on the principle of distributive justice. 
Its premise is that it is unethical to exclude individuals from partici¬ 
pation in potentially beneficial research. Obviously the protection 
of these individuals from harm by inclusion in research is equally 
important. Indeed because some of these groups include poten¬ 
tially vulnerable populations protection from harm and providing 
fully informed consent does present some unique challenges. 
For the purposes of this chapter we will confine discussion to two 
vulnerable groups commonly involved in research: children and 
incompetent adults. 

Often in incompetent adults the incompetence is caused by the 
disease which requires study. In this case the research cannot be 
done in a less vulnerable population and the intervention under 
study may directly benefit participants and others with the same 
disease. Consent is usually obtained from a proxy in this case, usually 
the next of kin or legal guardian, who is expected to act in the best 
interest of the individual participant. When studying incompetent 
adults it is important to recognize and establish that there are many 
types of specific competencies. While the individual being studied 
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may not be competent to understand all of the intricacies of the 
study and the informed consent process they may be perfectly 
competent to refuse a painful procedure (e.g., needle stick) as part 
of the study. 

Research in adults may not be generalizable to children for a 
variety of biological, developmental and psychosocial reasons. 
Quite often the disease being studied is more prevalent in or 
exclusive to children. Again proxy consent is required for minors 
with the exception of emancipated minors. However, children 
beyond a certain age are capable of understanding many of the 
issues involved and should be involved in the informed consent 
process and asked to give assent to any proposed research. In certain 
cases during the course of a study children may reach the age 
where legal consent is possible. If not incompetent for other rea¬ 
sons, they should then be asked to sign the informed consent 
document on their own behalf. 


9 Practical Tips for Researchers Applying to Research Ethics Boards 

• Familiarize yourself with the research ethical guidelines that 
are used in your jurisdiction. 

• Satisfy yourself that the research question is important and the 
research design is sound. 

• Do not cut and paste from the protocol into the ethics applica¬ 
tion. Summarize the protocol so that it can be easily read by all 
members of the research ethics board. Remember in most 
jurisdictions one member of the board is assigned to read the 
entire protocol and summarize for the other members. 

• Identify upfront what you think the ethical issues may be and 
present these in your application. 

• If you have a particular concern, get some advice prior to 
submission from an appropriate member of the ethics board. 

• Ensure that all sections on the form are complete and that the 
submission is signed. 

• If the research requires a consent form, spend time on prepar¬ 
ing it at a readable level. Most boards will index this against a 
certain educational level. Computer programs are available to 
assess readability level. 

• Remember the primary function of the research ethics board is 
to protect human subjects involved in research. Boards have an 
ethical obligation to facilitate sound ethical research while ful¬ 
filling this function. Interpret any comments or questions from 
the board with these two concepts in mind. 
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Chapter 3 


Definitions of Bias in Clinical Research 

Geoffrey Warden 

Abstract 

In this chapter a catalog of the various types of bias that can affect the validity of clinical epidemiologic 
studies is presented. The biases can be grouped into those associated with selection of subjects, misclassi- 
fication or misinformation, and finally confounding. Definitions are provided for each type of bias listed. 

Key words Bias, Definition, Selection, Misclassification, Misinformation, Confounding 


1 Introduction 


There are three main causes of inaccuracy in clinical epidemiologic 
research: random variation, confounding, and bias. Of the three 
causes, bias is the most easily controlled by the investigator and 
creates systematic errors that distort the measure of a study’s true 
effect when left uncontrolled. Meitten first classified biases into 
three broad categories: those that occur during the selection and 
grouping of study participants, the misclassification or misinfor¬ 
mation , and confounding [1]. Later, Sackett catalogued 57 biases 
that can arise in analytical research and listed the likely stages at 
which bias were likely to occur [2]. Numerous forms of bias have 
been added to these original lists, making it challenging for inves¬ 
tigators to recognize and describe the types of bias they are com¬ 
mitting or witnessing. Here, 150 types of bias occurring in clinical 
and observational epidemiological research have been collected, 
categorized, and defined with the goal of providing investigators a 
quick and accessible reference. 

First the general terms involving bias have been defined. 
Secondly, each type of bias has been categorized by the stage of 
research at which it is most likely to occur, listed alphabetically 
within each stage, and defined. Table 1 provides reference by stage 
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Table 1 

Classification of epidemiological biases by stage of research 


General terms (pg 34-35) 


Bias 

Non-differential bias 

Confounding bias 

Random error 

Differential bias 

Selection bias 

Design bias 

Systematic error 

Information bias/misclassification bias 


Literature review and publieations (pg 35) 


All’s well literature bias 

One-sided reference bias 

Foreign language exclusion bias 

Positive results bias 

Hot stuff bias 

Rhetoric bias 

Literature search bias 


Designing the study and selecting the study sample (pg 35-40) 

Admission rate/Berkson’s bias/referral bias 

Non-simultaneous comparison bias 

Allocation sequence bias 

Overdiagnosis bias 

Autopsy series bias 

Popularity bias 

Centripetal bias 

Prevalence-incidence bias/Neyman bias 

Channeling bias 

Previous opinion bias 

Consent bias 

Procedure selection bias 

Diagnostic access bias 

Record linkage bias 

Diagnostic purity bias 

Referral filter bias 

Diagnostic vogue bias 

Response bias 

Ecological fallacy/aggregation bias 

Sample size bias 

Exclusion bias 

Sampling bias 

Healthy worker bias/healthy worker effect 

Stage bias 

Immigrant bias 

Self-selection bias 

Inclusion control bias 

Spectrum bias/case mixed bias 

Lead-time bias 

Starting time bias 

Length bias 

Susceptibility bias 

Loss to follow-up bias/attrition bias 

Survivor treatment selection bias 

Membership bias 

Solicitation sampling bias 

Migration bias 

Unacceptable disease bias/faking Good bias 

Mimicry bias 

Unmasking bias/detection signal bias 

Misclassification bias 

Volunteer bias 

Missing clinical data bias 

Will Rogers phenomenon/Will Rogers 

Non-contemporaneous control bias 

Withdrawal bias/dropout bias 

Non-respondent bias 


Executing the intervention (pg 40-41) 


Bogus control bias 

Contamination bias 

Co-intervention bias 

Performance bias/procedure bias 

Compliance bias 

Proficiency bias 

Measuring exposures and outcomes (pg 41-45) 


Apprehension bias 

Case definition bias 

Attention bias/Hawthorne effect/observer 

Competing death bias 

effect 


Confirmation bias 

Intraobserver variability bias 


(continued) 
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Table 1 
(continued) 


Context bias 
Culture bias 

Data capture error/data capture bias 
Data entry bias 
Data merging error 
Detection bias/surveillance bias 
Diagnostic suspicion bias 
Diagnostic review bias 
End-aversion/central tendency bias 
End-digit preference bias 
Expectation bias 
Exposure suspicion bias 
Faking bad bias 

Family history/family information bias 

Forced choice bias 

Framing bias 

Hospital discharge bias 

Incorporation bias 

Indication bias 

Insensitive measure bias 

Instrument bias 

Interobserver variability bias 

Interview setting bias 

Interviewer bias 

Data analysis (pp 45-46) 

Anchoring bias/Adjustment bias 
Data dredging bias 
Distribution assumption bias 
Enquiry unit bias 
Estimator bias 

Missing data bias/data completeness bias 
Multiple exposure bias 
Non-random sampling bias 
Omitted-variable bias 
Outlier handling/tidying-up bias 

Interpretation and publication (pp 47) 

Auxiliary hypothesis bias 

Assumption bias 

Cognitive dissonance bias 

Correlation bias 

Generalization bias 

Interpretation bias 


Juxtaposed scale bias 
Laboratory data bias 
Latency bias 
Obsequiousness bias 
Observer bias 
Protopathic bias 
Positive satisfaction bias 
Proxy respondent bias 
Questionnaire bias 
Recall bias 
Reporting bias 
Response fatigue bias 
Review bias 
Scale format bias 

Sensitive question/social desirability bias 
Spatial bias 

Substitution game bias 

Test review bias 

Therapeutic personality bias 

Unacceptability bias 

Underlying cause bias/rumination bias 

Yes-saying bias 


Overmatching bias 
Post-hoc significance bias 
Repeated peeks bias 
Regression to mean bias 
Scale degradation bias 
Standard population bias 
Treatment analysis bias 
Verification bias/workup bias 
Verification (differential) bias 
Verification (partial) bias 

Magnitude bias 
Mistaken identity bias 
Significance bias 
Under-exhaustion bias 
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of biases the investigator may be looking for. Examples have been 
given when it was thought that definitions did not provide an 
adequate depiction. Further examples of bias in genetic disease 
research are provided in Chapter 20. 

To create this list of biases, a comprehensive literature search 
of numerous relevant medical databases (PubMed, EMBASE, 
The Cochrane Library, Biological Abstracts, Biomedical Reference 
Collection, CINAHL, Clinical Evidence) was performed using 
various combinations of the key words “bias,” “clinical,” “epide¬ 
miological,” “medical,” “error,” “systematic error,” “research,” 
“outcome measurement,” “study design,” “confounding,” and 
“methods.” Sackett’s original list was used to reference biases of 
analytical research [2]. Rodriguez’s review was also used to 
encompass other forms of bias [3]. Additionally, Porta’s 
“Dictionary of Epidemiology” and Gail’s “Encyclopedia of 
Epidemiologic Methods” were used to shape some definitions [4,5]. 
Reference lists of relevant papers were also surveyed to identify 
additional literature on individual biases. This list of bias may not 
be complete, but it is an attempt to comprehensively define the 
majority of biases an investigator will encounter in epidemiologic 
research. 


2 General Terms 


Bias : When a systematic error of study design distorts the true 
effect of the variable under study. 

Confounding bias'. When a separate variable, with non-intermediate 
relationship between exposure and outcome, is disproportionally 
distributed between exposure and control groups and causes a dis¬ 
tortion of the true effect of the variable under study. A confounder 
must be: (1) associated with the exposure in the target population, 

(2) a causal risk factor for the outcome in the unexposed cohort, 

(3) and not be an intermediate cause between exposure and 
outcome. 

Differential bias : When a bias unequally affects comparison groups 
so that the final measurements or outcomes, and the comparison 
between groups, are both a biased result. 

Design bias'. When the methodological architecture of a study cre¬ 
ates a positive or negative distortion of the study effect. 

Information bias/miselassifieation bias : When a systematic error in 
the measurement of exposures, outcomes, or covariates, results in 
information quality discrepancies between comparison groups and 
distorts the true effect of the variable under study. 
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Non- differential bias : When a bias affects both comparison groups 
in an equal manner so that although the final measure is a biased 
result, the comparison between groups remains unbiased. 

Random error. When a study error is due to chance variability and 
fluctuation of observation and measurement. 

Selection bias : When the methods used to select the sample popula¬ 
tion result in a favoring of a comparison group over another, or 
neglect a significant proportion of the target population. 

Systematic error When study error is inherent of a system’s design 
and causes repeated inaccuracies of observation and measurement. 


3 Biases Associated with Literature Review and Publications 

Alt’s well literature bias : When scientific groups publish reports 
that omit controversial or opposing results. 

Foreign language exclusion bias : When foreign language research is 
less often published or recognized. 

Hot stuff bias: When scientific literature is published based on the 
popularity and excitement around a topic and not the credibility or 
quality of research. 

Literature search bias : When a publication or statement is derived 
from an incomplete search and review of a topic’s literature. 

One-sided reference bias : When an investigator or author restricts 
referenced information to only those publications that support 
their position. 

Positive results bias : When positive results of topic become more 
frequently published negative results and bias an overall 
consensus. 

Rhetoric bias : When persuasive speech or writing is used to sway a 
reader’s interpretation of data to coincide with the author’s inter¬ 
pretation or position. 


4 Biases Associated with Study Design and Subject Selection 

Admission rate/Berkson’s bias/referral bias: When the relationship 
between exposure and disease is distorted by the selection of study 
participants from hospitals with inflated admission rates due to the 
study of the disease at said hospital. Patients in hospitals studying 
disease will have a higher exposure rate when compared to con¬ 
trols, which may inaccurately define the relationship between the 
exposure and disease during analysis. 
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Allocation sequence bias : When the intervention allocation sequence 
is not concealed to investigators and introduces an intentional or 
unintentional selection bias. 

Autopsy series bias : When disease rates and inferences are derived 
from a non-random autopsy series sample that does not represent 
the true population. 

Centripetal bias : When incidence and prevalence rates are inflated 
by the draw of clinician or institution reputation for working with 
a specific disorder. 

Channeling bias : When investigators channel patients who are 
more likely to have a treatment response to the intervention group 
rather than the control group. 

Consent bias'. When a study’s sample population does not represent 
its target population due to systematic differences between con¬ 
senting and non-consenting patients. 

Diagnostic access bias : When the accessibly to diagnostic testing 
results in the overestimation or underestimation of disease preva¬ 
lence or incidence rates. 

Diagnostic purity bias : When the target population is no longer 
represented due to the selection of an unrealistic “pure” diag¬ 
nostic group without the clinical comorbidities seen in the tar¬ 
get population. 

Diagnostic vogue bias : When a disease receives different diagnostic 
labels at different points in space or time resulting in the misclas- 
sification or exclusion of study participants from the target 
population. 

Ecological fallacy/ aggregation bias: When an effect or relationship 
on an aggregate level is applied to an individual level where it no 
longer holds true. 

Exclusion bias : When eligibility criteria for participant inclusion 
into a study are applied differently to cases and controls. 

Healthy worker bias/healthy worker effect : When the sample popu¬ 
lation derived from healthy volunteering participants no longer 
represents the target population. Those who are able to actively 
participate in a study have shown to be generally healthier and 
more compliant to study design than those unable or unwilling to 
participate. 

Immigrant bias: When immigrant populations experience different 
health outcomes than the native population. Immigrants have 
sometimes been shown to experience better health outcomes. This 
phenomenon may be due to only those healthy enough to work 
immigrate to a new population. It may also be due to immigrants 
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returning home when they fall ill and thus shifting the morbidity. 
Another theory is that an underreporting of disease and inaccessi¬ 
ble health care system causes the phenomenon. Regardless of the 
causality, it is important to ensure that immigrant populations are 
well represented and equally distributed in study groups. 

Inclusion control bias : When the inclusion criteria for controls are 
associated with the exposure, and a higher than expected rate of 
disease or outcome in the reference group creates a bias towards 
the null. 

Lead-time bias : When participant survival time is overestimated 
due to earlier diagnosis of Disease screening can appear to increase 
survival by providing an earlier diagnosis when compared to no 
screening. However, earlier diagnosis only creates a greater time 
interval between diagnosis and death. Without an effective inter¬ 
vention the age of death is often unchanged and only the time 
interval off known disease diagnosis has increased. 

Length bias : When rapidly progressing cases of disease are missed 
by screening and therefore disproportionately represented when 
compared to slowly progressing cases of disease. Disease detection 
is directly proportional to the length of time a disease is in a detect¬ 
able stage therefore slowly progressing cases are more easily identi¬ 
fied by screening. 

Loss to follow-up bias/attrition bias : When systematic differences 
between comparison groups, such as neglectful observation in 
control groups, cause an uneven loss or withdrawals of partici¬ 
pants from one group over another. Those who are lost to follow¬ 
up may be more or less likely to develop the outcome of interest 
and the true relationship between intervention and outcome may 
be distorted. 

Membership bias : When membership in a particular group corre¬ 
sponds with a level of health that is different from others in the 
target population. 

Migration bias'. When disease rates are misrepresented due to the 
migration patterns of those with disease. For example, those with 
severe seasonal allergies may be more inclined to move away from 
a rural and country environment and seek refuge in a less pollen 
rich urban area. This migration could create a surprisingly high 
prevalence rate of seasonal allergies in an urban setting. 

Mimicry bias'. When an exposure causes a benign disorder resem¬ 
bling the disease and subsequently becomes suspect of causing the 
disease. Disease status may be misclassified in their disease status 
causing investigators to find inaccurate associations between expo¬ 
sure and disease. 
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Misclassification bias : (Type I) Non-differential misclassification 
is when all classes, groups, or categories of a variable have the same 
rate of being misclassified for all study participants. (Type II) 
Differential misclassification is when the rate of being misclassi¬ 
fied differs between groups of the study. 

Missing clinical data bias : When the misclassification of study par¬ 
ticipants is due to missing or unrecorded clinical data. 

Non-contcmporancous control bias : When controls are sampled at a 
different time period and are no longer comparable to the current 
sample population due to changes in definitions, exposures, diag¬ 
noses, diseases and treatments. 

Non-rcspondcnt bias : When non-respondents systematically differ 
from respondents and the sample population is no longer represen¬ 
tative of the target population. 

Non-simultancous comparison bias : When exposure/intervention 
groups are compared to controls or reference standards in a differ¬ 
ent time and space. Different variables surrounding the time or 
space when groups are examined may influence the outcomes and 
cause poor generalizability. 

Overdiagnosis bias : When pseudo or subclinical disease, that would 
not have become apparent before the patient dies of other causes, 
is diagnosed based on investigator exploration. 

Popularity bias : When an interest in a particular disease or ther¬ 
apy causes preferential exposure of patients to observation or 
procedures. 

Prevalence-incidence bias/Ncyman bias'. When study participants 
are incorrectly put into the unexposed group due to short or silent 
evidence of exposure prior to disease. 

Previous opinion bias'. When a prior diagnostic result affects the 
subsequent diagnostic process of a patient. 

Procedure selection bias : When patients with poor risks are preferen¬ 
tially offered clinical procedures. 

Record linkage bias : When those who are not within in a linked 
database are not represented in the sample population. 

Referral filter bias : When an inflated prevalence of rare disease is 
due to patient referral to a center with a higher level of care. 

Response bias : When a study’s response rate and participant uptake 
is not large enough to accurately represent the target population. 

Sample size bias : When studies are designed with sample sizes that 
are too small to ensure that results are not a product of random 
variability. 
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Sampling bias : When all members of the target population do not 
have the same probability of inclusion in the sample population of 
the study. 

Stage bias'. When varying methods for determining the stage of disease 
are used across different geographical areas and time are spuriously 
used to compare different patient population survival rates. 

Self-selection bias: When a proportion of the target population is 
missed due to the sample population being created from partici¬ 
pants deciding whether or not they want to participate in the study. 

Spectrum bias/case mixed bias : When the accuracy of a diagnostic 
test is assessed in a population that differs from the target or clini¬ 
cally relevant population. 

Starting time bias : When the inability to use a common starting 
time for exposure or disease causes the misclassification of expo¬ 
sure or disease status in study participants. 

Susceptibility bias : When an exposure causes two separate diseases 
that precede one another and the treatment of the first disease 
falsely appears to cause the second disease. Susceptibility bias is a 
form of confounding. 

Survivor treatment selection bias : When an ineffective treatment 
appears to prolong survival; however, the benefit is due to patients 
who live longer having more time to select another effective treatment, 
while those who die earlier are untreated by default. 

Solicitation sampling (telephone/e-mail/door-to-door) bias : When a 
sample population is recruited by a method that is not accessible to 
a proportion of the target population. 

Unacceptable disease or exposure bias/faking good bias : When 
socially unacceptable diseases or exposures are underreported by 
participants and investigators and cause misclassification of study 
participants. 

Unmasking bias/detection signal bias : When an otherwise innocent 
exposure causes a sign or symptom of a disease and precipitates a 
search for said disease. A well-known example is that menopausal 
women administered estrogen for hormone replacement therapy 
experience some associated bleeding which leads to the investiga¬ 
tion and potential endometrial cancer detection. When a group of 
those receiving estrogen is compared to women not receiving estro¬ 
gen, the diagnosis of subclinical endometrial cancer in the non¬ 
estrogen group is delayed and one may interpret that the estrogen 
caused or lead to a more aggressive form of endometrial cancer. 

Volunteer bias : When a sample population does not represent the 
target population due to volunteer participants exhibiting proper- 



40 


Geoffrey Warden 


ties that systematically differ from non-volunteers. This bias is 
often prevalent in studies that exhibit high participant refusal rates. 
Studies with refusal rates >20 % should be shown that the volun¬ 
teering participants do not differ greatly from the refusals. 

Will Rogers phenomenon/Will Rogers bins : When moving a patient 
from one group to another group raises the average values of both 
groups. Named after the social commenter, Rogers joked, “When 
the Okies left Oklahoma and moved to California, they raised the 
average intelligence in both states.” This type of bias occurs when 
diagnostic ability increases, or disease criteria widens, and previ¬ 
ously disease negative patients are classified as disease positive sta¬ 
tus. Previously subclinical patients were sicker than average healthy 
individuals; however, they have a better prognosis than the previ¬ 
ous disease positive population. Therefore the new inclusion of an 
early stage disease in the disease positive status group creates a 
better overall prognosis for both the new positive and negative 
disease status groups. This phenomenon is most often seen in the 
new diagnostic tools and criteria of cancer staging as patients move 
from one stage to another. 

Withdrawal bias/dropout bias : When participants who have with¬ 
drawn from a study significantly differ from those who remain. 
This may cause two different problems with a study. Firstly, an 
equal withdrawal rate between comparison groups may mean the 
target population is no longer represented and create clinically 
irrelevant results; secondly, an unequal withdrawal rate between 
groups may create an inaccurate comparison from the loss of clini¬ 
cally relevant outcomes in the withdrawals. 


5 Biases Associated with Executing the Intervention 

Bogus eontrol bias : When the intervention group falsely appears 
superior due to the reallocation of participants with negative out¬ 
comes to the control group. 

Co-intervention bias : When the study effect is distorted by com¬ 
parison groups unequally receiving an additional and unaccounted 
intervention. 

Complianee bias : When non-significant results are due to poor par¬ 
ticipant adherence to the interventional regime, rather than inad¬ 
equacy of the intervention. 

Contamination bias : When members of the control group inadver¬ 
tently receive the experimental intervention. For example, if a par¬ 
ticipant in the placebo-control group were to receive the new 
medication under study from investigators or another study 
participant. 
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Performance bias/procedure bias : When factors within the experi¬ 
mental regime, other than the intervention under study, are sys¬ 
tematically employed differently between comparison groups. The 
maintenance of blinding helps ensure that both groups receive an 
equal amount of attention, treatment and investigations. 

Proficiency bias : When the intervention under study is unequally 
applied to groups due to differences in resources or the administra¬ 
tion of the intervention. This type of bias may occur when there 
are a number of people administering the intervention at multiple 
sites. To reduce proficiency bias, it is essential to create clear and 
efficient methodology for intervention administration. 


6 Biases Associated with Measuring Exposures and Outcomes 

Apprehension bias : When participant apprehension causes measure¬ 
ments to be altered from their usual levels. A classic example of this 
phenomenon is known as “White coat syndrome.” This common 
syndrome refers to the elevation patient blood pressure due to the 
apprehension and nervousness patients feel from a physician’s pres¬ 
ence. Often a patient’s blood pressure will be greatly reduced if 
they record it themselves at home. 

Attention bias/Hawthorne effect/observer effect : When participants 
alter their natural response or behavior due to investigator obser¬ 
vation. Usually people are more likely to give a favorable response 
or perform better due to their awareness of their involvement in a 
study and attention received during the study. 

Case definition bias : When uncertainty of the exact case definition 
leads to subjective interpretation by investigators. 

Competing death bias : When an exposure is falsely credited with 
the outcome of death over the causative and competing exposure. 
This most commonly seen during investigations of disease out¬ 
comes in geriatric populations as the increased age and comorbidi¬ 
ties create a high competing risk of death. 

Confirmation bias : When investigators favor data, outcomes, or 
results that reaffirm their own hypotheses. 

Context bias : When investigators diagnose a participant with dis¬ 
ease based on prior knowledge of disease in the population under 
study, despite test results that are only marginal or suggestive of 
disease. Examples of context bias often occur in radiology. A radi¬ 
ologist working in Africa is more likely to diagnose tuberculosis 
than an oncology center radiologist who identifies a cancer from 
the same abnormality on chest x-ray. 

Culture bias : When populations derived from separate cultures expe¬ 
rience different outcomes under the same intervention or exposure. 
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Data capture error/data capture bias : When a recurrent systematic 
error in the recording of data influences the outcomes and results. 
Data capture error can result in data collected being equally mis- 
representative between groups and create clinically irrelevant 
results, while data capture bias infers that the data capture errors 
are favoring or disfavoring the outcomes for one particular group 
more than the other. 

Data entry bias : When the process of converting raw data into a 
database results in a favorable outcome for either comparison group. 

Data merging bias : When the process of merging data results in a 
favorable outcome for either comparison group. 

Detection bias/surveillance bias : When exposure, or knowledge of 
exposure, influences the diagnosis of disease or detection of an 
outcome. 

Diagnostic suspicion bias : When the intensity or outcome of the 
diagnostic process is influenced by a prior knowledge of participant 
exposure status. 

Diagnostic review bias : When the interpretation of the reference 
diagnostic standard is made with knowledge of the results of 
the diagnostic test under study. 

End-aversion/end of scale/central tendency bias'. When questionnaire 
respondents avoid the ends of answer scales in surveys. Most 
respondents will report more conservatively and answer closer to 
the middle of a scale. 

End-digit preference bias : When observers record terminal digits 
with increased frequency while converting analog to digital data. 
A common example is that investigators, nurses, and clinicians 
prefer to record blood pressures, temperatures, respiratory and 
heart rates that end in rounded numbers when assessing a patient’s 
vital signs. 

Expectation bias: When investigators systematically measure or 
record observations that concur with their prior expectations. 

Exposure suspicion bias : When knowledge of the patient’s disease 
status influences both the intensity and outcome of a search for 
exposure. 

Eaking bad bias: When patients try to appear sicker to qualify for 
supports or study inclusion. 

Eamily history/family information bias: When knowledge of family 
member exposure or disease status stimulates a search and reveals a 
new case within the family. This bias causes increased prevalence 
rates when compared to families without a stimulated search. 
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Forced choice bias : When study participants are forced to choose 
predetermined yes or no outcomes without a non-response or 
undecided option available. 

Framing bias'. When a participant’s response influenced by the 
wording or framing of a survey question. 

Hospital discharge bias : When hospital mortality and morbidity 
rates are distorted by the health status of patients being transferred 
between facilities. Some hospitals may discharge patients more fre¬ 
quently or earlier in the disease course and shift the mortality and 
morbidity to other facilities. 

Incorporation bias : When the incorporation of the disease outcome 
or aspects of diagnostic criteria into the test itself inflate the diag¬ 
nostic accuracy of a test under study. 

Indication bias : When early or preventative treatment in high-risk 
individuals is falsely credited with causation of disease. 

Insensitive measure bias : When outcome measures are incapable of 
detecting clinically significant changes or differences. Insensitive 
measure biases often cause a type II error; a statistically significant 
difference between groups existed however the study was unable 
to prove the difference. 

Instrument bias : When defects in the calibration or maintenance 
of instruments cause systematic deviations in the measurement 
of values. 

Interobserver Variability bias : When multiple observers produce 
varying measurements of the same material. Different observers 
may be more likely to record differing measurements due to occur¬ 
rences such as “end-digit bias”. In order to decrease interobserver 
variability it is important to have strict measurement criteria that 
are easy to follow by a number of observers. 

Interview setting bias : When the environment in which an inter¬ 
view takes place has an influence on participant response. 

Interviewer bias : When an interviewer either subconsciously or 
consciously gathers selective data during the interview process. 

Intraobserver Variability bias : When an observer produces varying 
measurements on multiple observations of the same material. 

Juxtaposed scale bias : When different responses are given to the 
same item asked on multiple and different self-response scales. 

Laboratory data bias : When a systematic error affects all labora¬ 
tory results. This can be analyzed and confirmed by inter-laboratory 
comparisons. 
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Latency bias : When the outcome is measured in a time interval that 
is too brief for it to occur. 

Obsequiousness bias : When participants alter questionnaire responses 
in the direction they perceive the investigator desires. 

Observer bias : When observers have prior knowledge of participant 
intervention or exposure and are more likely to monitor outcomes 
or side effects in these participants. 

Protopathie bias: When the treatment of early or subclinical disease 
symptoms erroneously appears to cause the outcome or disease. 

Positive satisfaction bias : When participants tend to give positive 
answers when answering questions in regards to satisfaction. 

Proxy respondent bias: When a participant proxy, such as a family 
member, produces a different response than the study participant 
would himself or herself. 

Questionnaire bias: When survey or questionnaire format influ¬ 
ences respondents to answer in a systematic way. 

Recall bias: When a systematic error in participant questioning 
causes differences in the recall of past events or experiences between 
comparison groups. Recall bias most often occurs in case-control 
studies when questions about a particular exposure are asked sev¬ 
eral times to participants with disease but only once to controls. 

Reporting bias: When study participants selectively reveal or 
suppress information to investigators. 

Response fatigue bias: When respondents fatigue towards the end 
of an intervention or survey and become biased to either not finish 
or respond one-sided manner. 

Review bias: When prior knowledge or a lack of blinding influences 
investigators to make subjective interpretations. 

Seale format bias: When response scales are formatted to be an 
even or odd number of responses. Even scales force the respondent 
to choose a positive or negative response. Odd scales favor neural 
responses. 

Sensitive question/social desirability bias: When socially sensitive 
questions elicit socially desirable responses rather than a partici¬ 
pant’s true belief. 

Spatial bias: When spatially differing populations are presented and 
or represented as a singular entity. 

Substitution game bias: When investigators substitute a risk factor 
that has not been established as causal for an associated outcome. 
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Test review bicts\ When the interpretation of a diagnostic test is 
made with knowledge of the diagnostic reference standard test 
result of the same material. 

Thempeutie personality bias : When an un-blinded investigator’s 
conviction about the efficacy of a treatment influences a patient’s 
perception of the treatment benefit. 

Unaeeeptability bias : When measurements are systematically 
refused or evaded because they embarrass or invade the privacy of 
participants. 

Underlying eause bias/rumination bias : When cases and controls 
exhibit differing ability to recall prior exposures due to the time 
spent by cases ruminating about possible causes of their disease. 

Yes-saying bias'. When respondents more often agree with a state¬ 
ment than disagree with its opposite statement. Also used to define 
when survey participants repetitively respond positively without 
reading the questions after the first few positive responses. 


7 Biases Associated with Data Analysis 

Anehoring bias/adjustment bias : When an investigator either 
subconsciously or consciously adjusts the initial reference point so 
that the result may reach their estimate hypothesis. 

Data dredging bias : When investigators review the data for all 
possible associations without prior hypothesis. This “shotgun 
approach” to analyses increases the risk of a statistically significant 
result that has occurred by chance. Type 1 error. 

Distribution assumption bias : When the investigator inappropriately 
applies a statistical test under the incorrect assumption of the data 
distribution. Many statistical tests require that the data represent a 
“normal distribution” and should not be applied when this is not 
the case. 

Enquiry unit bias : When the choice of the unit of enquiry affects 
the analysis, results and impression. For example one could state 
that 70 % of hospitals do not offer dialysis; however, if the patient 
becomes the unit of enquiry, the applicable statement could be 
only 3 % of patients cannot receive dialysis treatment at their 
hospital. 

Estimator bias : When a large difference between an estimator’s 
value and the true value of the parameter occurs. For example, the 
odds ratio will always overestimate the relative risk. 

Missing data bias/handling data/data eompleteness bias : When 
data analysis occurs after a systematic loss of data from one of the 
comparison groups. 
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Multiple exposure bias : When multiple exposures are responsible for 
the magnitude of an outcome without being their individual effect 
being weighted. 

Non-random sampling bias : When statistical methods requiring 
randomly sampled groups are applied to non-random samples. 

Omitted-variable bias : When one or more important causal factors 
are left out of a regression model resulting in an overestimation or 
underestimation of the other factors effect within the model. 

Outlier handling/tidying-up bias : When the exclusion of outliers 
or unusual results from the analysis cannot be justified on statistical 
grounds. 

Overmatehing bias : When comparison groups are matched by non¬ 
confounding variables that are only associated with exposure, and 
not disease. 

Post-hoe significance bias : When decision levels of statistical signifi¬ 
cance are selected during the analysis phase to show positive results. 

Repeated peeks bias : When investigators continuously monitor 
ongoing results and discontinue their study when random variability 
has shown a temporarily positive result. 

Regression to mean bias'. When random variability accounts for the 
majority of a study’s significant result and follow-up investigations 
are found to have less or non-significant results. 

Seale degradation bias : When clinical data or outcomes are col¬ 
lapsed into less precise scales which obscure differences between 
comparison groups. 

Standard population bias: When the choice of standard population 
affects the estimation of standardization rates. 

Treatment analysis bias/lack of intention to treat analysis: When 
study participants are analyzed based on the treatment they 
received rather than the treatment that they were allocated to dur¬ 
ing randomization. 

Verification bias/workup bias: When the selective referral, workup, 
or disease verification leads to a biased sample of patients and 
inflates the accuracy of a diagnostic test under study. 

Verification bias: When prior test results are verified by two differ¬ 
ent reference standards dependent on a positive or negative index 
test result. The reference standard used for verification of positive 
results is usually more invasive and therefore not used in patients 
unlikely to be diagnosed with disease. 
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8 Biases Associated with Interpretation and Publication 

Auxiliary Hypothesis bias : When an ad hoc hypothesis is added to 
compensate for anomalies not anticipated by the original hypoth¬ 
esis. Often this will be done so that the investigator can state that 
had experimental conditions been different the original theory 
would have held true. 

Assumption bias : When an audience assumes that the conclusions 
of a study are valid because they have been published or are pre¬ 
sented to them by a speaker, without confirming that statements 
are accurate and coincide with corresponding data and references. 

Cognitive dissonanee bias : When an investigator states their belief 
despite the contradictory evidence of their results. 

Correlation bias : When correlation is equated to causation without 
sufficient evidence. Hill’s criteria is often used to show evidence for 
causation and is defined by nine aspects: (1) the strength of asso¬ 
ciation; (2) the consistency of findings observed; (3) the specificity 
of a specific site and disease with no other likely explanation; (4) a 
temporal cause and effect; (5) a biological dose response gradient; 
(6) a plausible mechanism between cause and effect; (7) coherence 
of epidemiological and laboratory findings; (8) That effect and 
condition can be altered by an appropriate experimental regimen; 
(9) that there has been consideration of alternate explanations: 

Generalization bias : When the author or reader inappropriately 
infer and apply a study’s results to different or larger populations 
for which the results are no longer valid. 

Interpretation bias : When the investigator or reader fails to draw 
the correct interpretations from the data and results achieved. 

Magnitude bias : When the selection of the scale of measurement 
distorts the interpretation of the reader or investigator. 

Mistaken identity bias : When strategies directed toward improving 
the patient’s compliance causes the treating investigator to pre¬ 
scribe more vigorously and the effect of increased treatment is mis¬ 
interpreted as compliance. 

Signifieanee bias : When statistical significance is confused with bio¬ 
logical or clinical significance and leads to pointless studies and 
useless conclusions. 

Under- exhaustion bias : When the failure to exhaust the hypothesis 
leads to authoritarian rather than authoritative interpretation. 
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Longitudinal Studies 1: Determination of Risk 

Sean W. Murphy 

Abstract 

Longitudinal or observational study designs are important methodologies to investigate potential associations 
that may not be amenable to randomized controlled trials. In many cases they may be performed using existing 
data and are often cost-effective ways of addressing important questions. The major disadvantage of observa¬ 
tional studies is the potential for bias. The absence of randomization means that one can never be certain that 
unknown confounders are present, and specific studies designs have their own inherent forms of bias. Careful 
study design may minimize bias. Establishing casual association based on observational methods requires due 
consideration of the quality of the individual study and knowledge of their limitations. 

Key words Longitudinal studies, Cohort study, Case-control study, Bias, Bisk factors, Sample size 
estimate 


1 Introduction 


Randomized, controlled trials (RCTs) are indisputably the gold 
standard for assessing the effectiveness of therapeutic interventions 
and provide the strongest evidence of association between a specific 
factor and an outcome. In this age of evidence-based medicine, 
“grades of evidence” based on clinical study design universally 
reserve the highest grade for research that includes at least one 
RCT. The lowest level of evidence is given to expert opinion and 
descriptive studies (e.g., case series), and observational studies are 
considered intermediate levels. 

Despite the strengths of an RCT, such a study design is not 
always feasible or appropriate. It is obvious that human subjects 
cannot ethically be randomized to exposure to a potentially nox¬ 
ious factor. In some cases, such as surgical versus medical treat¬ 
ments, alternative therapies for the same disease are so different 
that it is unlikely that patients would be indifferent to their choices 
to the degree that they will consent to randomization. Sometimes 
it is not possible to randomize exposure to a risk factor at all. 
Studies of the genetic contribution to a disease are a good example; 
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1.1 Non-randomized 
Study Designs 


2 Cohort Studies 


subjects either have a family history of the outcome or not, and this 
cannot be altered. RCTs may not be technically feasible if the out¬ 
come is relatively rare or takes a long period of time to become 
evident. In such instances the sample size will need to be large or 
the follow-up period long. While it is not true that a RCT will 
always cost more than an observational study, in many instances a 
clinical question that requires a very large RCT may be addressed 
much more cost-effectively with an alternative study design. For 
these reasons and many others, non-RCT studies make up a large 
proportion of the published medical literature. Many important 
clinical questions will never be subjected to an RCT, and clinicians, 
researchers, and policy-makers must rely on alternative study 
designs to make decisions. 

As a group, Non-RCT study designs are referred to as non- 
interventional or observational studies. The investigator does not 
actually interfere with the study subjects and seeks only to ascertain 
the presence or absence of either the risk factor or outcome of inter¬ 
est. Some observational designs, particularly cross-sectional studies, 
are best suited to determining disease prevalence and provide only 
weak evidence of risk factor and disease association. This focus of 
this chapter is on observational study designs aimed at providing 
stronger evidence of causal relationships, including treatment effect. 
These so-called longitudinal studies include case-control and cohort 
designs. The strengths and weakness of such designs are discussed, 
including the biases that may be inherent in each. The interpretation 
of results and technical issues such as sample size estimation are 
considered. 


A cohort is a group of subjects, defined at a particular point in 
time, that shares a common experience. This common factor may 
be a simple demographic characteristic, such as being born in the 
same year or place, but it is more frequently a characteristic that is 
considered a potential risk factor for a given disease or outcome. 
A cohort is followed forward in time and subjects are evaluated for 
the occurrence of the outcome of interest. 

Cohort studies are frequently employed to answer questions 
regarding disease prognosis. In this case, the common experience 
of the cohort is the presence of disease. In its simplest form, 
patients are enrolled in the study, followed prospectively, and the 
occurrence of specified symptoms or endpoints, such as death, are 
recorded. Because comparisons are not being made no control 
group is required. The data from such a study will be purely 
descriptive, but provides useful information regarding the natural 
history of that illness. More commonly researchers will be 
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2.1 Types of Cohort 
Studies 


2.2 Advantages 
of Cohort Studies 


interested in making some type of comparison. In studies of 
prognosis, the question may be whether or not a given factor in 
patients with a disease influences the risk of an outcome such as 
mortality. In this instance two or more cohorts are assembled—all 
subjects have the disease of interest, but some have the risk factor 
under study and others do not. This permits an analysis of the risk 
attributable to that factor. The risk factor under consideration 
might be a given disease treatment, and non-randomized trials of 
therapeutic interventions are in fact cohort studies of prognosis. 

Cohort studies are also commonly used to study the associa¬ 
tion of risk factors and the development of an outcome. In this case 
the defining characteristic of the cohort is the presence of or expo¬ 
sure to a specified risk factor. A second cohort that does not have 
the risk factor will serve as the control group. All subjects are fol¬ 
lowed forward in time and observed for the occurrence of the dis¬ 
ease of interest. It is essential that the study and control cohorts are 
as similar as possible aside from the presence of the risk factor being 
investigated. 

Cohort studies are inherently prospective in that outcomes can 
only be assessed after exposure to the risk factor. This does not 
mean that researchers must necessarily begin a cohort study in the 
present and prospectively collect data. Another approach is to use 
existing records to define a cohort at some point in the past. 
Outcomes may then be ascertained, usually from existing records 
as well, but in some cases subjects or their relatives may be con¬ 
tacted to collect new data. This type of design is referred to as a 
“historical cohort” and is clearly useful when the outcome takes a 
long period of time to develop after the exposure. Alternatively, an 
“ambidirectional” approach may be used, i.e., the cohort may be 
defined in the past, records used to assess outcomes to the present 
day, and subjects followed prospectively into the future (Fig. 1). 

Cohort studies are often an effective way to circumvent many of 
the problems that make an RCT unfeasible. Potentially harmful 
risk factors may be ethically studied, as the investigators do not 
determine the exposure status of individuals at any point. Cohort 
studies are suitable for studying rare exposures. It is often possi¬ 
ble to identify a group of individuals who have been subjected to 
an uncommon risk factor through occupational exposure or an 
accident, for instance. The historical cohort design is an excellent 
approach if there is a long latency period between exposure and 
outcome. 

Cohort studies have an advantage in that multiple risk factors 
or multiple outcomes can be studied at the same time. This can be 
easily abused, however, and it is advisable that a single primary 
outcome be identified, for which the study is appropriately 
powered, and the remaining secondary outcomes be considered 



54 


Sean W. Murphy 


Standard prospective 


Historical cohort exposure 


Ambidirectional exposure 


Present Day 


t 




^ outcome 


exposure 

^ measurement 


outcome 


2 -► 

measurement 


M _ 

■ 

^ outcome 



^ measurement 


-- ► 

Time 


Fig. 1 Types of cohort studies 


“hypothesis generating.” Depending on the mode of analysis, 
cohort studies can generate a variety of risk measures, including 
relative risks (RR), hazard ratios (HR), and survival curves. 

In the overall hierarchy of evidence, cohort studies are consid¬ 
ered a relatively powerful design to assess the outcome risk associated 
with a given factor. Unlike a cross-sectional or case-control study, 
one can usually be certain that the temporal sequence is correct, 
i.e., exposure precedes outcome. Cohort studies are also less prone 
to the so-called survivor bias than other observational designs. 
Disease cases that are instantly or rapidly fatal will be not be sampled 
in a study that selects cases from a living population. If such patients 
were included in a cohort study, their deaths would be recorded. 

2.3 Disadvantages The cohort design may be used for studying rare diseases or out- 

of Cohort Studies comes, but this often necessitates a very large sample size, espe¬ 

cially if researchers wish to make comparisons between groups. 

2.3.1 Missing Data A standard, prospective cohort study has the advantage that inves¬ 

tigators can establish clear definitions of both the exposure and 
outcome and then collect all necessary data on subjects as they 
enroll in the study or meet endpoints. While detailed inclusion 
criteria and outcome definitions are also required for historical 
designs, problems may arise when the required data are ambigu¬ 
ous, suspicious, or missing altogether. Differential follow-up 
between compared groups may be a major problem. Losses to 
follow-up, whether it is due to study withdrawals, unmeasured 
outcomes, or unknown reasons are always a concern. This is par¬ 
ticularly true when more outcome data is missing in one group 
than the other, as there is no way to be certain that the factor being 
studied is not somehow related to this observation. 
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2.3.2 Ascertainment Bias The classification of exposed versus unexposed subjects is also 

critical. One may be confident of exposure status in some cases, 
e.g., using employment records to determine if an individual 
worked in a particular factory. For exposures such as smoking or 
dietary habits, however, reliance upon patient recollection or 
records not intended to be used for research purposes may lead to 
misclassification of that subject. In many cohort studies the out¬ 
come will also be ascertained through existing records. For objec¬ 
tive endpoints such as death there is not likely to be any doubt as 
the time and certainty of the outcome. Outcomes such as the diag¬ 
nosis of a specific disease are potentially more problematic. Clinical 
studies generally require very specific criteria to be fulfilled to diag¬ 
nose disease outcomes to ensure internal validity. The application 
of such strict case definitions may not be possible based on past 
records. The accuracy of administrative databases and hospital 
discharge records for clinical disease diagnoses varies but may be 
insufficient for research use. 

2.3.3 Contamination In comparative cohort studies, treatment crossover may occur and 

is sometimes a substantial source of bias. Subjects initially unex¬ 
posed to the risk factor of interest may become exposed at a later 
date. Such “contamination” will tend to reduce the observed effect 
of the risk factor. If the groups being compared consist of a given 
treatment versus no treatment or an alternate therapy, it is possible 
that some controls will start treatment or subjects may switch ther¬ 
apies. It is unlikely that such switches will occur for no reason, 
and it is possible that subjects not doing well with their current 
treatment will be the ones changing groups. This will tend to 
reduce the observed effect and bias the study towards the null 
hypothesis. Contamination of groups in a cohort study is best dealt 
with by prevention. This may not be possible, however, and once 
it has occurred it is not easily addressed in the analysis phase. The 
conventional approach is to analyze the data in an intention-to- 
treat fashion, i.e., subjects are analyzed in the group to which they 
were originally assigned, regardless of what occurred after that 
point. This method is the only way to eliminate the potential bias 
that the reason for treatment crossover, or loss to follow-up for 
that matter, is somehow associated with the outcome [i]. Given 
that it reduces observed risk, however, some researchers choose to 
additionally analyze in a ‘Treatment-as-received” manner. A num¬ 
ber of statistical methods, including Cox Proportional Hazard 
modeling and Poisson regression, are suitable for this approach. In 
the latter technique, rather than allowing a single subject to remain 
in only one group for analysis, the amount of “person-time” each 
subjects represents within a study is considered. An individual may 
therefore contribute person-time to one, two, or more comparison 
groups depending on whether or not they switched assigned risk 
factor groups. A major problem, however, is the assignment of 
outcomes to risk factor groups. Death, for example, can only occur 
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once and is not necessarily attributable to the risk of the group that 
subject was in at the time that this occurred. Outcome assignment 
is therefore an arbitrary process and prone to bias in itself. When a 
treatment-as-received analysis is used, it is in the best interest of the 
investigator to present both this and the intention-to-treat results. 
If the results from the two analyses are concordant, there is no 
issue. When the two techniques five substantially different risk 
estimates, however, the reasons for the discordance need to be 
considered. 

2.3.4 Selection Bias Perhaps the largest threat to the internal validity of a cohort studies 

is selection bias, also called case-mix bias. The major advantage of 
a randomization is that variables other than the factor being stud¬ 
ied, whether they are known to be confounders or not, are bal¬ 
anced between the comparison groups. In a cohort study subjects 
are assigned to exposure or control groups by themselves, the indi¬ 
viduals treating them, or by nature. It is possible, if not probable, 
that the groups differ in other ways that have an effect of the out¬ 
come of interest. In a large cohort study of coffee consumption 
and coronary heart disease, for example, subjects who drank more 
coffee were much more likely to smoke, drank more alcohol on 
average, and were less likely to exercise or use vitamin supplements 
than subjects consuming lesser amounts [2]. These associated risk 
factors clearly have an effect on the probability of the outcome, 
and provide an example of a true confounder—the exposure and 
disease are not causal, but both are associated with other unmea¬ 
sured risk factors. When confounders are suspected, they may be 
quantified and controlled for in the analysis phase. Unfortunately, 
many confounders will remain unknown to the researchers and will 
bias the results in an unpredictable manner. 

Many variations of selection bias exist. Non-participation is a 
significant problem if subjects that choose to enroll in a study are 
not representative of the target population. The resulting risk esti¬ 
mates will not be an accurate reflection of the true exposure-out- 
come association because of this “response bias.” Non-participants 
have been shown to be older [3], have more illness [3,4], and less 
education [4, 5] than subjects who do enroll in clinical studies. 
Longitudinal genetic epidemiology studies are subject to response 
bias in often-unpredictable ways. Individuals who perceive them¬ 
selves at very low risk of a disease, such as those without a family 
history of a given illness, are less likely to enroll in studies [6]. On 
the other hand, those that are at very high risk may not participate 
due to high anxiety regarding the outcome [7]. The subjects who 
do participate, therefore, may be at more moderate risk but not 
truly representative of the general population. 

2.3.5 Bias by Indication Bias by indication is another form of selection bias. This occurs 

when subjects are more likely to be exposed to a factor because 
they have a second attribute associated with the outcome. Aspirin, 
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for example, may be incorrectly associated with a higher mortality 
risk if researchers do not account for the fact that individuals with 
established heart disease are more likely to be prescribed this drug. 

2.3.6 Dose- Cohort studies that are intended to compare treatment effect are 

Targeting Bias prone to dose-targeting bias, where subjects who fail to achieve a 

particular target for treatment have more comorbidity than those 
who do [8]. It is unlikely that multivariate adjustment for known 
confounders will compensate for these differences because many of 
these adverse risk factors will not be known to the investigators. 


3 Designing a Cohort Study 

Once the researcher has developed a clear, concise question to 
address, the decision to use a standard prospective or a historical 
cohort design usually depends on the timelines of the hypothe¬ 
sized exposure-outcome process and the resources available. A his¬ 
torical cohort is only possible when researchers have access to 
complete records of sufficient quality. 

As in a RCT, the investigators should decide on definitions of 
both the exposure or treatment of interest and the outcome to be 
measured a priori. It is important that those reading the study have 
a clear idea of what type of patient was sampled, and whether or 
not this information can be extrapolated to their own patient 
population, i.e., to be able to gauge the study’s external validity. 

3.1 Prevalent Versus Cohort studies that are based on the presence of a fixed character- 

Incident Cohorts istic of the subjects, such as the presence of a disease, are per¬ 

formed using a prevalent cohort of subjects, an incident cohort of 
subjects, or a combination of the two. A prevalent cohort refers to 
a group of subjects that as of a specified date have that characteristic 
(e.g., all patients in a center with prostate cancer on January 1st in 
a specified year). An incident cohort consists of all patients who 
develop that characteristic within a specified time interval (e.g., all 
patients who are diagnosed with prostate cancer in a center from 
January 1st 2013 to December 31st, 2013). A combination of the 
two, referred to as a period prevalent cohort, may also be used. 
When a prevalent cohort is used there is a possibility that the onset 
of the disease or risk factor in one group may be significantly dif¬ 
ferent in the two groups. If the risk of the outcome increases with 
time the group with the longer duration of exposure or disease will 
be disadvantaged. This is referred to as “onset bias” [9]. If the risk 
of the outcome is constant over time this bias may not be a factor, 
but many outcomes, most notably mortality rates, often increase as 
the study progresses. This bias may be present in a period prevalent 
cohort, so an incident cohort design is usually optimal. 
Unfortunately this option requires more time to accrue subjects 
and may not be feasible for this reason. 
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3.2 Data Collection 


3.3 Confounding 
Variables 


3.4 Selecting 
a Control Group 


It is critical that subjects be classified correctly with respect to their 
exposure and outcome status. In a comparative cohort study, these 
factors must be measured in precisely the same way in all subjects 
to minimize the risk of the various forms of information bias. It is 
not acceptable, for example, to use hospital records to identify out¬ 
comes in the exposed group but rely on self-reporting of disease 
status in controls. Ideally, researchers determining the outcome 
status should not be aware of the exposure status of individuals so 
they are not influenced by this knowledge. 

Careful study design can help minimize selection bias. Researchers 
should start by determining all known confounders that may be 
relevant to their proposed study. A careful review of the literature 
and consideration of plausible confounding associations may be all 
that one can do, but this will often result in variables that can be 
measured and accounted for in the analysis phase. 

The anticipated presence of known confounders may be dealt 
with in a number of ways in the design phase. The investigators 
may choose to include only subjects that are similar with respect 
to that factor and exclude those that differ. This process, referred 
to as restriction, is an effective way to reduce selection bias but 
comes at a cost—reduced generalizability and a smaller pool of 
subjects from which to recruit. A second approach is to “match” 
the exposure groups. Subjects in compared groups are specifically 
chosen to have the same values for confounding variables, such as 
age or gender. While effective, matching does not account for 
unknown confounders and can become quite difficult in larger 
studies. Alternatively, the researcher may choose to stratify sub¬ 
jects by the confounding factor. Subjects that differ with respect 
to the confounder may all be included, but they will be analyzed 
separately. Essentially post-hoc restriction, this approach may 
improve external validity. 

The optimal control group for a cohort study will be exactly the 
same as the study group with the exception of the factor being 
investigated. Importantly, both groups will have the same baseline 
risk of developing the outcome prior to the exposure. In all prob¬ 
ability this type of control group does not exist, but the researcher 
must do his or her best to approximate this situation. The control 
group can be selected from internal or external sources. Internal 
sources are unexposed subjects from precisely the same time or 
place as the exposed subjects. For example, one might select work¬ 
ers in a factory who definitely were not exposed to a given chemical 
to compare to workers in the same factory who were exposed to 
this potential risk factor. This type of control group is preferred, as 
one can be fairly confident that these people were similar in many 
respects. If one could not find enough unexposed controls in that 
factory, however, researchers might have to select controls from a 
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similar factory elsewhere. Such externally sourced control groups 
are less likely to be similar to the study patients. The least desirable 
option would be population-based controls. In this case the risk of 
the outcome in study subjects is compared to average rates observed 
in the general population. This is open to many biases, such as the 
possibility that subjects who work are healthier than those who 
cannot work (i.e., healthy worker bias). 


4 Case-Control Studies 

In a cohort study, researchers select subjects who are exposed to a 
risk factor and observe for occurrence of the outcome. In contrast, 
researchers performing a case-control study will select subjects 
with the outcome of interest (cases) and seeks to ascertain 
whether the exposure of interest has previously occurred (Fig. 2). 
The probability of exposure in the cases is compared with the 
probability of exposure in subjects who do not have the outcome 
(controls) and a risk estimate for that factor can be calculated. 
Case-control studies are inherently retrospective because knowledge 
of the outcome will always precede exposure data collection. 

Case-control designs are only used to estimate the association 
of a risk factor with an outcome. Because they are retrospective, 
these types of studies fall below cohort studies in the hierarchy of 
medical evidence. It is sometimes not easy to establish the temporal 
sequence needed to infer a causal relationship. Nevertheless, 
case-control studies are an efficient and cost-effective method to 
answer many questions and are thus widely used. 

4.1 Nested Case- Nested case-control studies are case-control studies performed on 
Control Studies subjects identified during a cohort study. To illustrate this, consider 

the hypothesis that exposure to a certain trace metal may 
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Fig. 2 Design of a case-control study 






60 


Sean W. Murphy 


4.2 Advantages 
of Case-Control 
Studies 


4.3 Disadvantages 
of Case-Control 
Studies 


4.3.1 Confounding 
Variables 


predispose children to a specific learning disability. To estimate the 
incidence of this problem, researchers may choose to enroll all 
school-aged children in a community in a cohort study and peri¬ 
odically test them to detect the disability. While it might be possi¬ 
ble to measure blood levels of the metal on all subjects, this would 
likely be expensive and logistically difficult. Using a nested case- 
control study, however, the researchers would do the cohort study 
first and identify the subjects affected by the disability (the cases). 
They could then randomly choose matched unaffected controls 
from the remainder of the cohort. Blood levels of the metal may 
then be collected on the cases and controls and analyzed in a man¬ 
ner similar to a standard case-control study. This approach has the 
advantage of limiting the need for expensive or difficult risk factor 
ascertainment. 

Case-control designs are often used when the outcome or disease 
being study is very rare or when there is a long latent period 
between the exposure and the outcome. A disease with only a few 
hundred cases identified worldwide, for instance, would require an 
impossibly large cohort study to identify risk factors. On the other 
hand, it would be entirely feasible to gather exposure data on the 
known cases and analyze it from this perspective. 

Case-control studies make efficient use of a small number of 
cases. Additionally, many potential predictor variables can be studied 
at once. Although a large number of comparisons lead to concerns 
about statistical validity, this type of study can be hypothesis¬ 
generating and quite helpful in planning more focused research. 

Unlike cohort studies, case-control studies cannot be used to study 
multiple outcomes. This type of study is not well suited to studying 
very rare exposures and provides no estimate of disease incidence. 
The major problems associated with this design, however, are 
confounding variables and specific types of bias. 

The issue of confounding variables has been discussed earlier in this 
chapter. Cases and controls are not randomly assigned and it is 
highly likely that any confounding variables are not equally distrib¬ 
uted in both groups. Most case-control studies will attempt to miti¬ 
gate the effects of known confounders by matching cases and 
controls. Appropriate matching has the additional advantage of 
decreasing the sample size required to find the desired difference 
between the groups. On the other hand, a variable that is used to 
match cases and controls can no longer be evaluated for an associa¬ 
tion with the outcome. “Overmatching,” or making the controls 
too similar to the cases, may result in a situation whereby the con¬ 
trols are no longer representative of the general population. This will 
result in data that underestimates the true effect of the risk factor. 
For these reasons it is best to avoid matching for a variable if there is 
any doubt as to its effect on the outcome 
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4.3.2 Sampling Bias 


4.3.3 Information Bias 


5 Designing a Case 


5.1 Selecting 
a Control Group 


Ideally the cases selected for a case-control study should be repre¬ 
sentative of all patients with that disease. If this is not the case then 
“sampling bias” may be present. This is not an unusual occurrence. 
A disease such as hypertension, for example, remains undiagnosed 
in the majority of people it affects; a case-control study that recruits 
cases from a hospital setting is not going to be representative of the 
general hypertensive population. It is often difficult to avoid this 
form of bias, but the possibility that it exists should be considered 
when interpreting the results. 

Certain types of information bias may be problematic in case- 
control studies. Because cases are usually more aware of the nature 
of their disease and potential causes of it compared to the population 
at large, they are more likely to remember certain exposures in their 
past. This is known as “recall bias.” The effect of this bias may be 
obviated by ascertaining exposure without relying on the subject’s 
recollections, e.g., by using past medical records. When this is not 
possible, researchers may try blinding the subjects to the study 
hypothesis. Careful selection of the control group may help. In a 
study of the association between family history and all-cause end 
stage kidney disease, for example, the investigators used the spouses 
of dialysis patients as controls. These subjects were generally as 
knowledgeable about kidney disease as their partners and therefore 
equally likely to report positive family histories if present [10]. 

A second common form of bias in case-control studies is 
“diagnostic suspicion bias.” This bias relates to the possibility that 
knowledge of the presence of disease may influence the investiga¬ 
tor’s interpretation of exposure status. The person recording such 
data should therefore be blinded as to the outcome status of the 
subjects if at all possible. 


Control Study 

Once an appropriate research question is postulated, the investi¬ 
gators should state clear definitions of the outcome or disease of 
interest and what constitutes exposure to the risk factor(s) to be 
studied. The target population should be identified and the method 
for recruiting cases outlined. Potential sampling bias must be 
considered. 

At this stage the investigators should identify potential con¬ 
founding variables. Restriction and stratification may be useful, but 
known confounders are most commonly dealt with by matching 
cases and controls. 

The selection of controls is a critical step in case-control study 
design. Subjects that have a higher or lower probability of being 
exposed to the risk factor of interest based on some other charac¬ 
teristic should not be used. In a study of the association between 
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bladder cancer and smoking, for example, it would not be 
appropriate to select matched controls from the cardiology ward in 
the same hospital. Patients admitted with cardiac disease are more 
likely to have smoked than the general population and an underes¬ 
timation of the risk is likely to occur. To strengthen the study one 
might select controls from several populations, such as an inpatient 
ward, and outpatient clinic and the general population. If the risk 
estimate relative to the cases is similar across the different control 
groups the conclusion is more likely to be correct. Finally, if the 
cases are drawn from a population-based registry it is a common 
practice to select controls from the general population within that 
geographic area. 


6 Power and Sample Size Estimation in Observational Studies 

6.1 Statistical Power Sample size estimation is an important part of the design phase of 

any study, and observational designs are no different. The sample 
size requirement is often a major factor in determining feasibility of 
a study. Obviously, the resources available and population from 
which to draw recruits must be sufficient for the study to be possi¬ 
ble. Less obvious, perhaps, is ability of adequate sample size to 
ensure that the study is meaningful regardless of whether the result 
is positive or negative. The probability of a type I error, i.e., finding 
a difference between groups when one actually does not exist, is 
expressed by the parameter a. This is typically set at 0.05, and a 
^-value less than this from statistical tests will be interpreted as indi¬ 
cating a statistically significant difference. The probability of a type 
II error, on the other hand, is the chance of not finding a difference 
between groups when one actually does exist. This is expressed by 
the parameter / 3 , and 1-/7 is referred to as the power of a study. It is 
typically set at 0.8, meaning that there is a 20 % chance of a type II 
error, or an 80 % chance of reporting a difference if one truly exists. 
The higher the desired power of the study, the larger the sample 
size required. A study with too small a sample size will be poorly 
powered; if such a study does not find a difference between groups, 
it will be difficult to tell whether there is truly no difference or not. 
Underpowered studies are uninformative and journal reviewers will 
generally reject them for this reason alone. 

Aside from power and the threshold for statistical significance, 
several other factors will influence the require sample size. Foremost 
among these is the minimum difference between the comparison 
groups that the researcher deems significant. The smaller this dif¬ 
ference is, the larger the required sample size will be. Sample size 
will also be larger given a larger variance of the outcome variable. 
More “noise” in the outcome will make it more difficult to distinguish 
any differences. 


6.2 Factors 
Determining Required 
Sample Size 
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Sample size calculations are relatively straightforward for RCTs, 
and most investigators perform them. Unfortunately, sample size 
or power estimation is often not considered in the design phase of 
observational studies. Often investigators are limited by the 
amount of data available, especially in historical studies. Cohort 
studies are sometimes limited by the total number of exposures, 
and case-control studies by the total number of cases that can be 
identified. In these instances power, rather than required sample 
size, can be calculated and the utility of the study considered from 
that perspective. 


6.3 Calculating 
Required Sample Size 
for a Relative Risk or 
Odds Ratio 


The exact approach to sample size estimation depends on the nature 
of the outcome data and the analyses to be used. The sample size 
formulas for relative risk (RR) and odds ratio (OR) calculation for 
two groups are essentially the same the same as one would use for 
comparing two proportions. In the simplest situation, the sample 
size will be equal for both groups. This also minimizes the total 
sample size required. The researcher specifies the desired OR or RR 
(equivalent to the minimum detectable difference between groups) 
and the expected proportion of outcomes in the control group; 
the proportion in the exposed group is calculated as follows: 


RR = 


P exposed 
P unexposed 



so p 2 


RR • p Y 
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^exposed /(} “^exposed) Pi/^-Pi) 


P unexposed / ("^ P unexposed ) Pl / ~ Pl) 


so p 2 = 


OR • p l 


1 + A (OR -1) 


With pi and p 2 known, the formula for the required sample size in 
each group is: 
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Z a is the two-tailed Z value related to the null hypothesis and Zp is 
the lower one-tailed Z value related to the alternative hypothesis 
[11]. These values are obtained from appropriate tables in statisti¬ 
cal reference books, but for the common situation where a = 0.05 
and (3 = 0.80, Z a = 1.96 and ^ = 0.84. 


6.4 Calculating 

Sample Size 

for a Log-Rank Test 


Studies that produce time-to-event outcome data from two or 
more groups are generally analyzed using survival methods (dis¬ 
cussed later in this chapter). If the Kaplan-Meier method is used, 
statistical comparisons of groups are made with the log-rank test. 
The general principles of power and sample size estimation for this 
method are no different than that discussed above. Again, for 
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maximum simplicity and power one may consider the case where 
two groups are to be compared with an equal number of subjects 
in both. One must first estimate the total number of outcomes (d) 
that must be observed: 



-Z, 


P(upper) 


f 


V 


1 - VJ/^j 
1-Vy 



In (Survival group 2 at end of study) 
In (Survival group 1 at end of study) 


Z a is the two-tailed Z value related to the null hypothesis and Z^ 
is the upper one-tailed Z value of the normal distribution corre¬ 
sponding to 1 -j} [12]. Once this is solved, the sample size (n) for 
each of the groups is: 

d 

2 - (Survival group 1 at end of study) - (Survival group 2 at end of study) 


6.5 Calculating 
Sample Size for a Cox 
Proportional Hazards 
Model 


Methods have been established to estimate power and sample size 
for studies to be analyzed using the Cox Proportional Hazards 
method. The researcher must specify a, subject accrual time, antic¬ 
ipated follow-up interval, and the median time to failure in the 
group with the smallest risk of the outcome [13]. The minimum 
hazard ratio (HR) detectable is then used to calculate either the 
power or the estimated sample size. It is advisable to make use of 
any of a number of software packages available to perform these 
relatively complex calculations [14, 15]. 


7 Analyzing Longitudinal Studies 

7.1 Identifying Once the data is collected the researcher should analyze informa- 

Confounders tion regarding the distribution of potential confounders in the 

groups. Very often the first table presented in a paper is the base¬ 
line characteristics of the study groups. The most common 
approach to identifying differences is to use simple tests (j2 or 
£-tests) to identify statistically significant differences. The results of 
such comparisons are highly sensitive to sample size, however. Very 
large studies may result in statistically significant but clinically 
unimportant differences, and small studies may not demonstrate 
significant ^-values when differences exist. An alternative approach 
is to present standardized differences, an indicator that is less 
affected by study size. The difference between the groups is divided 
by the pooled standard deviation of the two groups. Standardized 
differences greater than 0.1 are usually interpreted as indicating a 
meaningful difference [16]. 
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7.2 Calculating 
an Estimate of Risk 


7.2.1 Simple Hypothesis 
Testing for Risk Estimates 


The desired result of a longitudinal study is most often a quantitative 
estimate of the risk of developing an outcome given the presence 
of a specific factor. The method used to generate this estimate 
depends on three factors—the study design, the nature of the out¬ 
come data and the need to control for confounding variables. 

In the simplest situation, the outcome data for a cohort study 
will be the incidence or number of outcomes in each of the com¬ 
pared groups, and for a case-control study the number of exposed 
individuals in the cases and controls. If confounders are either not 
known to be present or have been minimized through restriction or 
matching there is no need to adjust for them though statistical 
means. The appropriate analysis for a cohort study in this situation is 
the RR, defined as Incidence(e X posed)/I nc idence( unex posed)- This is easily 
calculated from a 2 x 2 table (Table 1 ). A RR greater than 1.0 implies 
and increased risk, whereas a RR less than 1 is interpreted as demon¬ 
strating a lower risk of the outcome if the factor is present. A RR 
of 0.6, for example, implies a 40 % protective effect, whereas a RR of 
1.4 indicated a 40 % increased risk in exposed individuals. 

The measure of risk resulting form a case-control study is an OR, 
the odds of having been exposed in cases relative to controls. The RR 
cannot be calculated in a case-control study because the researcher 
determines the number of controls. The OR is generally an accurate 
approximation of the RR as long as the number of subjects with 
the outcome is small compared with the number of people without 
the disease. The calculation of the OR is shown in Table 1. 

The most common null hypothesis to be tested for OR or RR is 
that either parameter is equal to 1, i.e., there is no difference 
between the compared groups. The traditional approach to direct 
hypothesis testing is based on confidence intervals (Cl). The upper 
and lower bounds of the Cl are calculated based on a specified 
value of a. This is usually set at 0.5, and the resulting 95 % Cl is 
interpreted as containing the true value of the risk estimate 
with 95 % confidence. The Cl is valuable in itself as a measure of 
the precision of the estimate. A wide Cl indicates low precision. 

Table 1 

Calculation of relative risk and odds ratios 



Disease/ 

outcome 



+ 

— 


Risk factor + 

a 

b 

. . */U + b) 

Relative Risk (RR) = t - - - -J 

V ' [c/(c + d)] 

— 

c 

d 

Odd Ratio (OR) = 

V J be 
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Table 2 

Calculation of 95 % confidence intervals (95 % Cl) for relative risks and 
odds ratios. See Table 1 for parameter definitions 


7.2.2 Controlling 
for Confounders 


For a Relative Risk. 


f 


95% Cl = exp 


ln(RR)±1.96, 


v 


fl- a/a + b] 1- c/c + d] 
a e 






For an Odds Ratio : 


95% Cl = exp 


ln(OR)±1.9 6 f 

V ’ 


1111 

abed 


A 

) 


The calculation of the Cl for an OR or RR is complicated by the 
fact that ratios are not normally distributed. Natural logarithms of 
the ratios are normally distributed, however, and are thus used in 
the computation (Table 2). Once the 95 % Cl is known, testing the 
null hypothesis is simple. If the 95 % Cl does not include 1.0, it can 
be concluded that the null hypothesis can be rejected at a signifi¬ 
cance level of 0.05. An alternate approach to hypothesis testing is 
to apply/ 2 test to the 2x2 contingency table used to compute the 
RRor RR [12]. 

When confounding variables are present the researcher will normally 
want to minimize their effect on the risk estimate. Simple RR or OR 
calculation will not be sufficient. In the analysis phase, the investi¬ 
gator has two methods available to accomplish this—stratification 
and regression. 

Stratification has been discussed earlier in this chapter. In either 
a cohort or case-control design, study groups are subdivided by 
values of the confounding variable and analyzed separately. Crude, 
or unadjusted risk estimates are computed for each stratum. The 
Mantel-Haenszel technique produces an “adjusted” summary sta¬ 
tistic for the combined strata, which are weighted according to 
their sample size [17, 18]. If the adjusted risk estimate differs from 
the crude one, confounding is likely present and the adjusted value 
is the more reliable of the two. 

Regression refers to any of a large number of statistical tech¬ 
niques used to describe the relationship between one or more pre¬ 
dictor variables and an outcome. It is the ability to simultaneously 
analyze the impact of multiple variables that makes them valuable 
for dealing with confounders. The researcher cannot only “adjust” 
for the confounding variables but can quantify their effect. The 
appropriate type of regression is primarily determined by the nature 
of the outcome data. Continuous variable outcomes, such as height 
or weight, are generally analyzed by multiple linear regression. 
Although this type of analysis will produce an equation allowing 
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7.3 Analyzing 
Survival Data 


one to predict the impact of a change in the predictor variable on 
the outcome, linear regression does not produce an estimate of 
risk. A more common type of outcome data in observational stud¬ 
ies is a dichotomous variable, i.e., one that may assume only one of 
two values. Either the outcome occurred (such as the diagnosis of 
a disease or death) or it did not. This type of data may be analyzed 
using logistic regression. Although limited to dichotomous out¬ 
comes, logistic regression has the advantage of producing a risk 
estimate (the risk ratio, interpreted the same way as a RR or OR) 
and a confidence interval. Multiple predictor variables can be ana¬ 
lyzed, adjusting for confounders. 

Many cohort studies produce time-to-event or survival outcome 
data. This type of outcome contains far more information than the 
simple occurrence of an outcome. Survival data records when an 
event happened, allowing the researchers to learn much more 
about the natural history of a disease, for example, and the poten¬ 
tial impact of risk factors on how soon an outcome occurs. 

Time-to-event data is analyzed by any one of a number of sur¬ 
vival techniques. Life table analysis, essentially a table of cumulative 
survival over time, is not a widely used method because it requires 
all subjects to reach the specified outcome. If the outcome is death, 
for example, all subjects must die. It is a descriptive technique that 
does not allow for comparisons between groups and produces no 
risk estimate. A more common situation is that not all subjects 
reach the endpoint by the time the observational study ends. Some 
subjects may drop out, some may be subject to competing risks, 
and some will simply not reach the endpoint at the end of the fol¬ 
low-up period. These subjects will be censored, i.e., the data will 
be incomplete at study end and it will not be known when, or if, 
they reach the specified outcome. The fact that these subjects had 
not reached the endpoint up until a specified point is still very 
informative. A technique that makes use of censored data is the 
Kaplan-Meier method [19]. This method will generate graphical 
representations of cumulative survival over time, i.e., survival 
curves, for one or more groups. Although the overall survival of 
the different groups may be compared using the Log-Rank test, 
the Kaplan-Meier method does not produce a risk estimate. 
Confounders can be analyzed by stratification, but creating many 
groups quickly becomes cumbersome and difficult to interpret. 
The commonest multivariate form of survival analysis is based on 
the Cox-Proportional Hazards Model [20]. This very powerful 
technique allows researchers to control for the effect of multiple 
predictor variables and generate “adjusted” survival curves. It gen¬ 
erates a risk estimate analogous to the RR, the Hazard Ratio, and 
corresponding confidence intervals. This technique is widely used 
in the medical literature. It is discussed in more depth in a later 
chapter. 
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7A Limitations It should be kept in mind that multivariate regression techniques, 

of Multivariate while extremely useful, still have limitations. Statistical power may 

Techniques be an issue in many studies. In general, the number of predictor 

variables that may be included in a regression model is a function of 
the sample size. Using the common rule of thumb that there must 
be 10 outcomes observed for every variable included in the model, 
one must have 100 outcomes to analyze the effect of 10 predictor 
variables. Using too many predictor variables will result in an under¬ 
powered analysis incapable of detecting true differences. 

It is also important to understand that regression and survival 
methods all have certain assumptions that must be met to ensure 
the validity of the results. A detailed discussion of these require¬ 
ments is beyond the scope of this chapter, but the researcher must 
understand the assumptions and the methods of assessing them for 
the analytic technique they are using. 
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Chapter 5 


Longitudinal Studies 2: Modeling Data 
Using Multivariate Analysis 

Pietro Ravani, Brendan J. Barrett, and Patrick S. Parfrey 

Abstract 

Statistical models are used to study the relationship between exposure and disease while accounting for the 
potential role of other factors impact upon outcomes. This adjustment is useful to obtain unbiased esti¬ 
mates of true effects or to predict future outcomes. Statistical models include a systematic and an error 
component. The systematic component explains the variability of the response variable as a function of the 
predictors and is summarized in the effect estimates (model coefficients). The error element of the model 
represents the variability in the data unexplained by the model and is used to build measures of precisions 
around the point estimates (Confidence Intervals). 

Key words Statistical models, Regression methods, Multivariable analysis, Effect estimates, Estimate 
precision, Confounding, Interaction 


1 Introduction 


Longitudinal data contain information on disease related factors 
and outcome measures. Clinical researchers use statistical models 
to test whether an association exists between one of these outcome 
measures and some exposure, such as a risk factor for disease or an 
intervention to improve prognosis. 

The present chapter provides introductory notes on general 
principles of statistical modeling. These concepts are useful to under¬ 
stand how regression techniques quantify the effect of the exposure 
of interest while accounting for other prognostic variables. 


2 Principles of Regression and Modeling 

2.1 Role of Statistics The task of statistics in the analysis of epidemiological data is to 

distinguish between chance findings (random error) and the results 
that may be replicated upon repetition of the study (systematic 
component). For example, if a relationship between blood 
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22 Concept 
of Function 


pressure levels and left ventricular mass values exists, the response 
(left ventricular mass) is expected to change by a certain amount as 
blood pressure changes. In the Multiethnic Study of Atherosclerosis, 
a large-scale multicenter application of cardiac magnetic resonance 
in people without clinical cardiovascular disease, left ventricular 
mass was found to be on average 9.6 g greater (95 % Confidence 
Intervals from 8.5 to 10.8) per each standard deviation (21- 
mmHg) higher systolic blood pressure [1]. A statistical model was 
used to quantify both the average change in left ventricular mass 
per unit change in systolic blood pressure (systematic component) 
and the variability of the observed values unexplained by the model 
(random error, summarized by the Confidence Intervals). 

The estimated effect (left ventricular mass increase) attributed 
to a factor (greater systolic blood pressure) is considered valid 
(close to the true left ventricular mass change per blood pressure 
change) if all sources of sampling and measurement bias have been 
adequately identified and their consequences successfully pre¬ 
vented and controlled. In fact, statistics only conveys the effect of 
the chance element in the data but can neither identify nor reduce 
systematic errors (bias) in the study design. The only bias that can 
be controlled for during statistical analyses is “measured” con¬ 
founding. Finally the estimated effect is unbiased if the statistical 
tool is appropriate for the data. This includes the choice of the cor¬ 
rect function and regression technique. 

Most clinical research can be simplified as an assessment of an 
exposure-response relationship. The former is also called input (X, 
independent variable or predictor) and the latter output (T, depen¬ 
dent variable or outcome). For example, if left ventricular mass is 
the response variable and the study hypothesis is that its values 
depend on body mass index, smoking habit, diabetes, and systolic 
blood pressure [1], then the value of left ventricular mass ( y ) is said 
to be a “function of’ these four variables (v 1? v 2 , v 3 , and v 4 ). 
Therefore, the term “function” (or equation) implies a link exist¬ 
ing between inputs and output. 

A function can be thought of as a “machine” transforming 
some ingredients (inputs) into a final product (output). Technically 
the ingredients (term or expression on which a function operates) 
are called the “argument” of that function. Just as any machine 
produces a specific output and has a typical shape and its own 
characteristics, similarly any function has its specific response vari¬ 
able and has a typical mathematical form and graphical shape. As 
in the above example, more than one input variable may take part 
in the function, or if you prefer, more “ingredients” may be used 
to make a final product. This is important when studying simulta¬ 
neously the relative and independent effects of several factors on 
the same response and during control for confounding. 
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Popular Function of LP 



Exp f(LP) 

f(LP) = exp (LP) 


- Identity f(LP) 

f(LP) = l(LP) = LP 


Logistic f(LP) 


f(LP) = 


exp(LP) 

1 + exp (LP) 


LP 

Fig. 1 Example of three common functions of the linear predictor (LP): the identity function does not change 
the LP (linear model); the exponential function is the exponentiated LP (Poisson model); the logistic function is 
a sigmoid function of LP (logistic model). Note that different functions have not only different shapes but also 
different ranges. With permission Ravani et al., Nephrol Dial Transplant [12] 


A very important “ingredient” to define is the “linear predic¬ 
tor” (LP), which is the “argument” of the statistical functions that 
will be discussed in the following chapter. LP contains one or more 
inputs (the “Xs”) combined in linear fashion. Figure 1 shows three 
important functions of the LP: the identity function, which does 
not modify its argument and gives LP as output; the exponential 
function of LP; and the logistic function of LP. The underlying 
mathematical structure is not important here. However, two 
aspects should be noted: first, different transformations change the 
“shape” of the relationship between LP and its function; second 
despite that the LP can range from -oo to +oo (allowing any type 
of variable to be accommodated into it), its function can be con¬ 
strained into a range between 0 and 1 (logistic function); can have 
a lower limit of 0 (exponential function); or can just have the same 
range as the LP (identity function). 
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2.3 Regression 
Methods 

2.3.1 Estimation Purpose 


2.3.2 Meaning 
of the Model Parameters 


Regression strategies are commonly applied in practice. Doctors 
measure blood pressure several times and take the average in new 
patients before diagnosing hypertension. Doctors do the same 
when further checks provide unexpected values. Averages are 
regarded as more reliable than any single observation. Also in bio- 
statistics the term regression implies the tendency toward an aver¬ 
age value. Indeed regression methods are used to quantify the 
relationship between two measured variables. For example, if there 
is a linear relationship between age and 5-year mortality, the aver¬ 
age change in output (mortality) per unit change in one (age; uni¬ 
variable regression) or more than one input (age and blood 
pressure; multivariable regression) can be estimated using linear 
regression. This estimation task is accomplished by assigning spe¬ 
cific values to some elements (unknowns) of the specific regression 
function. These elements are called pammeters and the values 
assigned to them by the regression procedure are called parameter 
estimates. For example, in the above example, the linear function 
of mortality has the following mathematical form: mortality = 
LP + 6* = p 0 +/? age x age + £*, where LP is a linear function of the form 
Po+fixxx and e is the variability in the data unexplained by the 
model. In this simple (univariable) expression there are two 
“unknowns” to estimate: /? 0 , representing the intercept of the line 
describing the linear relationship between the independent vari¬ 
able (age) and the response (mortality); /? x , representing the aver¬ 
age change of the response (mortality) per unit change of the 
independent variable (age). 

The intercept (/? 0 ) is the average value of the response variable “y” 
(mortality for example) when the independent variable in the 
model is zero. This makes sense when the independent variable 
can be zero (for example, when diabetes is coded 0 in nondiabet¬ 
ics and 1 in diabetics). When the independent variable is continu¬ 
ous (age, blood pressure or body weight, for example) the 
intercept only makes sense if the continuous variables in the model 
are recoded using the deviates from their means. For example, if 
the average age in the sample is 50, then the value of age for a 
40-year-old subject becomes -10, for a 52-year-old subject is 2, 
etc. The model of 5-year mortality is the same after this recoding 
but the intercept of the model now is the value of the response 
(mortality) when the input (or inputs), such as age or other con¬ 
tinuous variables, is (are) zero. 

Linear regression allows estimating the terms “/? age ” (as well 
as more /3 S were more inputs modeled), which are the regression 
parameters (or population characteristics). In the above example the 
most important coefficient is the regression coefficient of age because 
this coefficient estimates the hypothesized effect of age on mortality. 
In fact the regression parameters explain how the response variable 
changes as the predictor(s) change. 
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Observed values (y) 


Line of fitted values (50 
Deviate or residual (y - y) 




Exposure (x) 

Fig. 2 Ordinary least square method: the regression line drawn through a scatter-plot of two variables is “the 
best fitting line” of the response. In fact this line is as close to the points as possible providing the “least sum 
of square” deviates or residuals (vertical dashed lines). These discrepancies are the differences between each 
observation (“y”) and the fitted value corresponding to a given exposure value With permission Ravani 
et al., Nephrol Dial Transplant [12] 

2.3.3 Estimation There are different methods to estimate the equation parameters 

Methods of a regression model. The method commonly used in linear 

regression, for example, is the Ordinary Least Square (OLS) 
method. In lay words this method chooses the values of the func¬ 
tion parameters (/? 0 , /? age ) that minimize the distance between the 
observed values of the response y and their mean per each unit of 
x (thus minimizing “e”). Graphically this corresponds to finding a 
line on the Cartesian axes passing through the observed points and 
minimizing their distance from the line of the average values 
(Fig. 2). This line (the LP) is called the line of fitted (expected) 
values toward which the observed measures are “regressed.” Other 
estimation methods exist for other types of data, the most impor¬ 
tant of which is Maximum Likelihood Estimation (MLE). MLE, as 
opposed to OLS, works well for both normally (Gaussian) and 
non-normally distributed responses (for example, Binomial or 
Poisson). However, both MLE and OLS choose the most likely 
values of the parameters given the available data, those that mini¬ 
mize the amount of error or difference between what is observed 
and what is expected. For example, given three independent 
observations of age equal to 5, 6, 10, the most likely value of the 
mean parameter is 7, because no other value could minimize the 
residuals further (give smaller “square” deviates). As another 
example, given two deaths among four subjects the MLE of the 
risk parameter is 0.5 because no other values can maximize the 
MLE function further. 
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2.3.4 Likelihood 
and Probability 


In complex models with several parameters (coefficients) to esti¬ 
mate, the principles of calculus are applied (for all models), as it is very 
difficult to make even reasonable guesses at the MLE. Importantly, tra¬ 
ditional MLE methods (for all models, including those for nor¬ 
mally distributed data) are not appropriate if outcome data are 
correlated because they calculate joint probabilities of all observed out¬ 
comes under the assumption that they are independent. For example 
given (indicated by “|”) d failures in n subjects, the likelihood of the 
observed outcome “ji” is the product of each independent event: 


, n-d 


MLE(7t d / n) = n x (l - n) x ...x (l -k) = n d x (l -n) n " . This is 
the “form” of the ML function of the “risk parameter,” which is based 
on the product of independent probabilities. The most likely parameter 
estimate (the value of the risk n) after observing two failures in four 
subjects is 0.5 since no other value of n (given d =2 and n= 4) would 
make larger the ML function of n (this can be proven in this simple 
example plugging different values for the unknown “jt” given d =2 
and n= 4). However, this is valid if each observation is independent (in 
other words, if measurements are performed on unrelated individuals). 


Likelihood and probability are related functions as the likelihood 
of the parameters given the data is proportional to the probability 
of the data given the parameters. However, when making predic¬ 
tions based on solid assumptions (e.g., knowledge of the parame¬ 
ters) we are interested in probabilities of certain outcomes occurring 
or not occurring given those assumptions (parameters). Conversely, 
when data have already been observed (the Latin word “datum” 
means “given”) they are fixed. Therefore, outcome prediction is 
performed estimating the probability of the datajyiven the parame¬ 
ters, whereas parameter estimation is performed maximizing likeli¬ 
hood of the parameters jyiven the observed data. Examples of 
probability functions are the Gaussian distribution of a continuous 
variable (e.g., systolic blood pressure values in a population) given 
some values of the parameters (e.g., mean and standard deviation); 
the binomial probability distribution of the number of failures 
given the parameters n (probability of failure) and n (number of 
trials); or the Poisson distribution of an event count given the 
parameter (expected events). For example, assuming known values 
of k parameters (effects of k independent variables), the probability 
n of an event d dependent on those k model parameters 0=p Q , /ft, 
P 2 , ..., fk_i, fk is n(d\6), which is read “the probability of d given 
the parameters OP When the parameters are unknown, the likeli¬ 
hood function of the parameters 6 = ft 0 , /ft, /ft, •••, Pk-\, Pb can be 
written as ~L(6\d/n ), which is read “the likelihood of the parame¬ 
ters 6 given the observed d/nP The aim of MLE is to find the 
values of the parameters 0=p Q , Pi, /ft, ..., pk _ x , pk (and, conse¬ 
quently, n or other expected values such as the mean of a quantita¬ 
tive variable) that make the observed data ( d/n ) most likely. 
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3 Statistical Models 

3.1 Meaning 
and Structure 


Models are representations of essential structures of objects or real 
processes. For example, the earth may be approximated to a sphere 
in astronomical or geographic calculations although it is rather an 
oblate spheroid being flattened at the poles. Nevertheless depend¬ 
ing on the objectives, the inaccuracies deriving from the calcula¬ 
tions carried out under the proposition of sphericity may not only 
be acceptable but also advantageous with respect to precise calcula¬ 
tions based on the “real” shape of the planet. The example can be 
extended to phenomena, such as animal experiments or relation¬ 
ships among different individual characteristics. For example, 
nephrologists expect that anemia worsens as kidney function 
decreases. This expectation is based on the idea of a relationship 
(model) between kidney function and hemoglobin concentration. 
To study and describe this relationship, epidemiologists may define 
a statistical model based on a certain model function and regres¬ 
sion method. For example, given a reasonably linear relationship, 
at least within a certain range of kidney function, a linear model 
may be a good choice to study anemia and kidney function data. 
Indeed even in the presence of mild deviations from ideal circum¬ 
stances, the representation of a process by means of a simple model, 
such as the linear model, helps grasp the intimate nature and the 
mechanisms of that process. Obviously critical violations of model 
assumptions would make the model inappropriate. The linear 
model would be wrong if the relationship was exponential. Similarly 
the sphere would not be an acceptable model if the earth were a 
cone. In any case the hope is that a mathematical model can 
describe the relationship of interest providing a good compromise 
between appropriateness of the chosen function, simplicity, inter- 
pretability of the results, and little amount of residual error (unex¬ 
plained variation in the data). This error is an important component 
of statistical models. In fact, biologic phenomena, as opposed to 
deterministic phenomena of physics or chemistry, do not yield the 
same results when repeated in the same experimental conditions 
but are characterized by a considerable amount of unpredictable 
variability. Therefore, probabilistic rather than deterministic 
models are applied to biomedical sciences because they include 
indexes of uncertainty around the population parameters estimated 
using samples. These indexes allow probability statements about 
how confident we can be that the estimated values correspond to 
the truth. 

These characteristics of statistical models are reflected in their 
two major components: the systematic portion or fit and the ran¬ 
dom or chance element. For example, the fit portion of a linear 
model is a line and the errors are distributed normally with mean 
equal to zero (Fig. 3). In other models the fitted portion has 
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Fig. 3 Components of a statistical model: the statistical model of the response (Y) includes a systematic com¬ 
ponent (SC) corresponding to the regression line (the linear predictor LP in linear regression or some transfor¬ 
mation of the LP in other models) and an error term (£) characterized by some known distribution (for the linear 
model the distribution is normal, with mean = 0 and constant variance = o 2 ). With permission Ravani et al., 
Nephrol Dial Transplant [12] 


different shapes and the residuals have different distribution. The 
fit component explains or predicts the output variable whereas the 
random component is the portion of the output data unexplained 
by the model. In the previous example of linear regression of mor¬ 
tality the amount of change in mortality per unit change in age is 
the main “effect” of age. This quantity times age gives the esti¬ 
mated average mortality for that age. For example if the estimated 
parameters “/? 0 ” and “/? age ” of the linear model of 5-year mortality 
are 0.01 (line intercept) and 0.001 (line slope), the expected risk 
of death of a 60-year-old subject is 7 %in 5 years (0.01 + 0.001 x 60). 
However, a 7 % risk of death in 5 years may not correspond to the 
observed value for the 60-year-old subject in our data. A 60-year- 
old subject in the current data may even not exist or belong to a 
category of risk of 0.1 or 0.05. The difference between what has 
been recorded and the value predicted by the model is the error or 
residual, and is summarized in the random portion of the model. 
With biologic phenomena further information can reduce this 
error, for example refining the precision of the estimate of subject 
mortality, but even including several inputs into the model the 
“exact” value of the response (mortality) can never be established. 
In other words, some amount of variation will remain unexplained 
after fitting the model to the data. 

3.2 Model Choice The most appropriate statistical model to fit the data depends on 

the type of response variable because this determines the shape of 
the relationship of interest (fit portion) and the typical distribution 
of the errors (chance element). Fortunately, the form of most 


3.2.1 


General Approach 
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3.2.2 Form 

of the Exposure-Response 
Relationship 


input-output relationships and error distributions are known. This 
permits orientation in the choice of a model before data are col¬ 
lected. Once the model has been built, its systematic and random 
components are verified graphically and using formal tests based 
on residuals in order to ensure that the chosen model fit the data 
well. These procedures are called assumption verification and 
model specification check and will not be discussed in this text. 
However, principles of model choice and checks can be briefly 
summarized using the linear model as an example. This model is 
easy to understand because in linear regression the LP is the argu¬ 
ment of a function called “identity” (i.e., a function that does not 
have any effect, like multiplying the argument by 1). In other 
words, the observed values of the response variable are modeled as 
“identity” function of the LP, plus an error term (as in the example 
of 5-year mortality). 

There are three fundamental assumptions to satisfy when using a 
statistical model: (1) the necessary condition to use a model is that 
the relationship between the response and the exposure reflect the 
mathematical form of that model; (2) the difference between what 
is observed and expected have a distribution compatible with the 
specific model; and (3) that these residuals be independent. 
Assumption 1 pertains to the systematic component of the model. 

An example of assumption 1 is that to use the linear model 
there must exist a linear relationship between the response variable 
and the predictors because this is the meaning of the identify func¬ 
tion of LP (linear function). In geometry and elementary algebra a 
linear function is a “first degree polynomial” which eventually has 
the form of the LP. The coefficients of this function are real con¬ 
stants (“/V’ and “/?s”) and the inputs “Xs” are real variables. These 
definitions may seem complex, but all it means is that the effect of 
each predictor in a linear function results in a constant change in the 
output for all their values. This function is called linear because it 
yields graphs that are straight lines. In other models, the response 
variable is not linearly related to the LP, but the relationship can 
be exponential or logistic for example. As a result, the input-output 
relationship has different forms in nonlinear models as compared to 
the linear model. Yet, independent of the transformation “applied” 
to the LP, in the most popular statistical models it is still possible to 
recognize the LP in the right hand side of the model function 
(Fig. 1 ). The transformation implies that the change in the response, 
as the exposure changes, is no longer linear but describable by other 
curve shapes. Consequently, the change in the form of the relation¬ 
ship (the function) corresponds to a change in the meaning of the 
coefficients (parameter estimates). These coefficients remain differ¬ 
ences in the LP. However, the functional transformation of the LP 
changed their epidemiological meaning. This is why, for example, 
differences in the coefficients correspond to odds ratios in logistic 
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Linear homoscedastic Non-linear heteroscedastic 




Exposure (x) 

Fig. 4 Linearity and equal variance: in the left panel the response is linearly 
related to the exposure and has constant variance (homoscedasticity). In the 
right plotXm possible important violations are depicted: nonlinearity and unequal 
variance (heteroscedasticity). With permission Ravani et al., Nephrol Dial 
Transplant [12] 

regression and incidence rate ratios in Poisson regression. 
Importantly their meaning in the LP remains the same and the 
linearity of the LP is checked in other models independent of the 
specific transformation applied to the function argument. This 
“shape” assumption pertains to the systematic component of all 
models (Fig. 4). 

3.2.3 Random Component Assumptions 2 and 3, which must be satisfied in order to use a 

statistical model, pertain to the random component of the model, 
i.e., the “e” parameter in linear regression, or more generally the 
difference between what it is observed and what is expected (resid¬ 
uals). First, the residuals must have some distribution compatible 
with the specific model. For example, they must be normally dis¬ 
tributed around the fitted line with mean equal to zero and constant 
variance in linear regression; they must follow the binomial distri¬ 
bution in logistic regression and the Poisson distribution in Poisson 
regression. For example, in a study of Asymmetric-Di-Methyl- 
Arginine (ADMA) and levels of kidney function measured as 
Glomerular Filtration Rate (GFR), the observed GFR of the sam¬ 
ple subjects was found to be approximately symmetrically distrib¬ 
uted above and below the fitted line of GFR as a function of 
ADMA, with equal variability along the whole line [2]. This means 
that the dispersion of the observed GFR values around the fitted 
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3.2.4 Data Transformation 


3.2.5 Meaning 

of the Model Assumptions 


line must be symmetrical (error mean = zero) and constant. 
In other words, the same amount of uncertainty must be observed 
for all values of ADMA (Fig. 4). Second, the residuals must 
be independent as is true for all models. This is possible only if the 
observations are independent. This condition is violated if more 
measures are taken on the same subjects or if there are clusters in 
the data, i.e., some individuals share some experience or conditions 
that make them not fully independent. For example, if some sub¬ 
jects share a genetic background or belong to the same families, 
schools, hospitals, or practices, the experiences within clusters may 
not be independent and, consequently, the residuals around the 
fitted values may not be independent. This implies that once some 
measurements have been made it becomes possible to more accu¬ 
rately “guess” the values of other measurements within the same 
individual or cluster and the corresponding errors are no longer 
due to chance alone. This final assumption must be satisfied in the 
study design and when there is correlation in the data, appropriate 
statistical techniques are required. 

When the necessary conditions to use a certain model are clearly 
violated, they can be carefully diagnosed and treated. For instance, 
often nonlinearity and unstable variance of a continuous response 
can be at least partially corrected by some mathematical transfor¬ 
mations of the output and/or the inputs in order to use the linear 
model. This can be necessary also for the inputs of other nonlinear 
models. Urinary protein excretion for example is often log- 
transformed both when it is treated as output [3] and as an input 
variable [4]. However, once a transformation has been chosen the 
interpretation of the model parameters changes accordingly and 
can become difficult to understand or explain. For these reasons, 
complex transformations as well as inclusion of power terms in the 
LP may be useless even if they allow the specific model assump¬ 
tions to be met. Some reports do not clearly explain the meaning 
of the parameters (in terms of interpretation of the “change”) of 
some complex models [3-6]. 

The three conditions pertaining to the shape of the relationship, 
the error distribution and error independence should be imagined 
in a multidimensional space if the model (as often happens) is 
multivariable. They have the following meaning. Once the model 
has been fitted to the data (1) it must be possible to quantify the 
amount of change of the output per unit change of the input(s), 
i.e., the parameter estimates are constant and apply over the whole 
range of the predictors; (2) what remains to be explained around 
the fit is unknown independent of the input(s) values (3) and is 
independent of the process of measurements. For more detailed 
discussion on applied regression the interested reader is referred to 
specific texts [7]. 
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Unconditional 
Response (no input) 




Fig. 5 Information gain and residual variance: the residual variance gets progressively smaller as more infor¬ 
mative inputs are introduced into the model as compared to the unconditional response (distribution of the 
response without any knowledge about exposure). The inputs are systolic blood pressure, SBP (x h in mmHg) 
and body mass index, BMI (x 2 , in kg/m 2 ). With permission Ravani et al., Nephrol Dial Transplant [12] 


3.3 Multivariable vs. An important purpose of multiple regression (in all models) is to 

Univariable Analysis take into account more effects simultaneously, including confound¬ 

ing and interaction. A graphical approach using the linear model 
may help understand this meaning of multivariable analysis. 

When only the response variable is considered, e.g., the overall 
mean and standard deviation of left ventricular mass [1], the larg¬ 
est possible variability is observed in the data (Fig. 5, unconditional 
response). The variability in the output data becomes smaller if the 
response is studied as a function of one input at a time or, better, 
two input variables at the same time (conditional distribution of 
the response). This is accomplished by multiple regression: when 
more inputs are considered simultaneously the systematic compo¬ 
nent of the model contains more information about the variability 
of the response variable and the amount of error or unexplained 
variability gets smaller. The intercept and standard error of the 
model without input variables (“null model,” i.e., y=/3 0 + e) are the 
parameters of the unconditional distribution of left ventricular 
mass (mean and standard deviation). 
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Regression plane of LVM 
over BMI & SBP 


Regression planes of LVM 
over BMI & SBP by DM 



Fig. 6 Three-dimensional representation of the linear model: two quantitative predictors generate a plane in 
the three-dimensional space (Left Ventricular Mass—LVM over Systolic Blood Pressure—SBP and Body Mass 
Index—BMI). The number of fitted planes increases with the number of levels of a qualitative input (e.g., 
Diabetes—DM). With permission Ravani et al., Nephrol Dial Transplant [12] 


Figure 6 shows the multidimensional consequences of intro¬ 
ducing more inputs. With two quantitative predictors such as sys¬ 
tolic blood pressure and body mass index, the fitted values of left 
ventricular mass lie on a plane in the three-dimensional space, the 
plane that minimizes the residuals. The addition of a third quanti¬ 
tative variable would create a hyper-plane in the multidimensional 
space and so on. Of note qualitative inputs, such as diabetes, sepa¬ 
rate the fitted values on more planes, one per each level of the 
independent variable. This plane would have S-sigmoid or some 
other sophisticated shape in case other models are used to fit the 
data, but the impact of multivariable analysis in the multidimen¬ 
sional space has the same meaning. 


4 Confounding and Interaction 

4.1 Confounding A eonfounder is an “extraneous” variable associated with both the 

outcome and the exposure without lying in the pathway connecting 
the exposure to the outcome. Conversely, a marker is only related to 
the exposure (indirect relation), whereas an intermediate variable 
explains the outcome. Finally, two inputs are eolinear when they 
carry the same or at least similar information (Fig. 7). For example, 
Heine et al. studied renal resistance indices (a marker of vascular 
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Confounding 


Co-linearity 




Independent 




Marker 




Intermediate 



X, 





Fig. 7 Possible relationships between input and outcome: a confounding factor 
(X c ) is an independent variable associated with the response (Y 0 ) and with the 
main exposure of interest TO without being in the pathological pathway between 
exposure and outcome. A marker {X M ) is associated with the exposure only and 
has no direct relationship with the outcome. Two inputs (X E and X f ) may also have 
an independent association with the response: this is the ideal situation as it 
maximizes the information in the data. Colinearity is a phenomenon whereby two 
inputs (X^ and X 2 ) carry (at least partially) the same information on the response. 
An intermediate variable (Aj) lies in the pathological path leading to the outcome. 
With permission Ravani et al., Nephrol Dial Transplant [12] 


damage studied with ultrasound examination) in subjects with 
chronic kidney disease not yet on dialysis [8]. In this study, intima- 
media thickness of the carotid artery was significantly associated 
with the response (renal resistance) in baseline models that did not 
include age. Once age was entered into the model intima-media 
thickness lost its predictive power. The authors showed that older 
patients had thicker carotid artery walls. Thus, intima-media thick¬ 
ness may be a confounder, a marker or even an intermediate variable. 
In the final model of the resistance indices study the introduction of 
phosphate “lowered” the coefficient of glomerular filtration rate 
possibly because phosphate increase is one mechanism through 
which kidney function reduction contributes to higher resistance 
indices. However, outside an experimental context the nature of 
these multiple associations can only be hypothesized. 

The confounding phenomenon is of particular importance in the 
analysis of non-experimental data. The way regression analysis removes 
the association between the confounder and the outcome (the neces¬ 
sary condition for the confounding phenomenon) is straightforward. 
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4.2 Interaction 
in Additive Models 

4.2.1 Definition 
and Examples 


4.2.2 Modeling 
Confounding 
and Interaction 


Consider the following model including the exposure and the 
confounding factor: y=/3 0 +/ r 3 E E+p c C+£. The difference between the 
observed response and the effect of confounding left in the model 
gives the effect of the exposure: y-/3 c C=p 0 +p E E+£. The right hand 
part of the formula is now a simple regression. The same applies to 
other models estimating odds or incidence rate ratios. This is the 
epidemiological concept of independence: independent means puri¬ 
fied from other effects (including confounding), with this accom¬ 
plished by going back to simple regression, removing the effect of 
extraneous variables kept in the model and looking at how the 
response changes as a function of one input at a time only. 

A final issue to consider in multiple regression is the existence of an 
interaction between two inputs. An interaction is a modification of 
the effect of one input in the presence of the other (and vice versa). 
For example, Tonelli et al. studied the effect of pravastatin on the 
progression of chronic kidney disease using data from the CARE 
trial [9]. They found that inflammation was associated with higher 
progression rate and pravastatin with significantly slower kidney 
disease progression only in the presence of inflammation. 
Inflammation modified the effect of pravastatin (and vice versa). 

The interacting variables (called main terms) can be of the 
same type (qualitative or quantitative) or different type. The inter¬ 
action effect can be qualitative (antagonism) or quantitative (syn¬ 
ergism). For example, in the study of Kohler et al. [5] both body 
mass index and HbAlC were directly related to the response 
(albumin/creatinine ratio) when considered separately. However, 
the coefficient of the interaction had a negative sign indicating that 
the total change of log(ALB/CR) in the presence of one unit 
increase of both inputs (0.1535 + 0.0386) needs to be reduced by 
that quantity (-0.0036). In linear models interactions involving at 
least one quantitative variable change the slope of the fitted line 
since the effects associated with quantitative variables are differ¬ 
ences in slope of the line. Interactions involving only qualitative 
variables change the intercept of the line. 

Confounding and interaction are two distinct phenomena. 
Interaction is an effect modification due to the reciprocal strength¬ 
ening or weakening of two factors, whereas confounding is deter¬ 
mined by differential (unequal) distribution of the confounding 
variable by level of the exposure and its association with the 
response. All multivariable analyses allow estimation of the expo¬ 
sure effect purified from confounding and interaction. A potential 
confounding should be kept in the model even if not significant, 
unless there are reasons not to do so. For example, a likely con- 
founder might be left out if the model is already too complex 
for the data (there are too many parameters and few observations 
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4.2.3 Statistical Meaning 
of Interaction 


or event), and the effect of the main exposure does not vary 
substantially whether the confounder is included or excluded. 
The amount of acceptable change in the exposure effect in the 
presence or absence of the confounder in the model can be a mat¬ 
ter of debate and the adopted policy should be explained in the 
reporting. Conversely, the interaction term is kept in the model 
only if the associated effect is statistically significant (sometimes 
considering a more generous P value of 0.1 for an interaction 
model). The formal test for interaction is based on the creation 
of a product term obtained multiplying the two main terms. 
For example: inflammation and treatment group [9]; or body mass 
index and HbAlC [5]. Coding treatment (TRT) 0 for placebo and 
1 for pravastatin, and inflammation (INF) 1 if present and 0 if 
absent, the interaction term (INT) is 1 for active treatment and 
inflammation and 0 otherwise. If one term is a continuous variable, 
the interaction term is 0 when the qualitative variable is 0 and 
equal to the continuous variable when the qualitative variable is 1. 
If there are two quantitative variables, the interaction term equals 
their product. Thus, the interaction model in this linear regression 
is y=/? 0 +ArtT+Anf2+AntINT + £ (plus all the other variables in 
the model). In the absence of a significant effect associated with 
the interaction term (INT), there is no effect modification (and the 
interaction term can be left out of the model). This means that the 
effects of treatment and inflammation are the same across the levels 
of each other. In the presence of an interaction effect, the effect 
modification phenomenon must be considered in addition to the 
main effects. Of note, to be interpretable the interaction must 
always be tested in the presence of the main terms since any inter¬ 
action is a difference in differences. 

The formal test for the presence of interaction (introducing a 
product term in the model as explained for the linear model) tests 
whether there is a deviation from the underlying form of that 
model. For example, if the effect of age and diabetes on some event 
rate are respectively /?age = 0.001 (rate change per year of age) and 
Aw = 0.02 (rate change in the presence of diabetes) and there is no 
interaction, then the two fitted lines corresponding to the presence 
and absence of diabetes are constantly 0.02 rate units apart. The 
model is called additive. Conversely, if there is an interaction effect 
Ant = 0.001 (further rate change per year of age in diabetics), the 
two lines of the interaction model are not only 0.02 unit of rate 
apart due to the effect of diabetes but also constantly (proportion¬ 
ally) diverging by a certain amount. This amount is 2 in this exam¬ 
ple, because (0.001 xAGE+ 0.001 xAGE)/0.001 xAGE = 2. In 
other words, rates are twice as high in the presence of diabetes at 
any age category because diabetes modifies the effect of age (and 
vice versa). This model is called multiplicative. More generally, the 
response ratio is (A + Ant)/A when the interaction is between one 
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DM (interaction model) 

Rate = 0.02 + 2*(0.001*Age) 


DM (no interaction model) 
Rate - 0.02 + 0.001 *Age 


ND (no interaction & 
interaction models) 
Rate = 0.001 *Age 


Fig. 8 Interaction parameter as a measure of the departure from the underlying form of a model: the plot shows 
two models of some event rate as a function of age and diabetes without interaction and with their interaction 
term. When diabetes is absent (ND, bottom line) the event rate is explained by age only in both models. When 
diabetes is present (DM) the fitted line of the event rate depends on age and diabetes according to the no 
interaction model (middle line) and on age, diabetes, and their product (INT) in the interaction model (top line). 
In the no interaction model the effect of diabetes consists in shifting the event rate by a certain amount quanti¬ 
fied by the coefficient of diabetes (change in the intercept of the line). In the interaction model the (dashed) 
fitted line is not only shifted apart for the effect of diabetes but also constantly diverging from the bottom line 
(absence of diabetes). The amount of change in the slope is the effect of the interaction between age and 
diabetes and is a measure of the departure from the underlying additive form of the model. With permission 
Ravani et al., Nephrol Dial Transplant [12] 

quantitative (xi) and one qualitative main term with only two val¬ 
ues. When both main terms are continuous the response ratio is 
not constant since /? INT x INT varies as the interaction term value 
varies. However, the coefficient of the interaction term is constant 
and in all models /? INT estimates the amount of departure from the 
underlying form of the linear model. 

Figure 8 shows how the interaction between age and diabetes 
modifies the slope of the event rate over age. Rate estimates are 
higher in diabetics by a fixed amount (effect of diabetes) as com¬ 
pared to nondiabetics. This amount (coefficient) remains constant 
in the no interaction model (rate as a function of age and diabetes, 
continuous lines). When an interaction is present there is a con¬ 
stant ratio between the two lines resulting in a slope change 
(because one of the main terms is continuous) in addition to the 
effect of diabetes (rate as a function of age, diabetes, and their 
interaction, dashed line and bottom line). 

4.2.4 Epidemiological The coefficient of interaction is always a difference: a difference 

Meaning of the Interaction between differences. In fact considering the example in Fig. 8, 

Coefficient there is a rate difference of 0.001 units per year of age; a difference 
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4.3 Interaction 
in Multiplicative Models 


of 0.02 between diabetics and nondiabetics; and a further difference 
of 0.001 to consider (in addition to 0.021) if a subject is 1 year 
older and diabetic (total rate difference 0.022 units in a diabetic 1 
year older than a nondiabetic). An interaction between two con¬ 
tinuous variables would change the slope of the fitted line without 
affecting the model intercept. For example, if there was an interac¬ 
tion between age (e.g., coefficient 0.001) and systolic blood pres¬ 
sure (e.g., coefficient 0.0001) and the coefficient of interaction was 
0.00001, then a subject 1 year older and with 1 mmHg higher 
systolic blood pressure would have an estimated rate of 0.00111 
units (0.00001+0.0011) higher. 

When the LP is the argument of some non-identity functions, 
differences in the LP assume different epidemiological meaning. 
In Cox’s, logistic, or Poisson regressions for example, differences 
in the LP have the meaning of risk ratios. As opposed to linear 
models in which the combined effect of several factors is the sum 
of the effects produced by each of the factors, Cox’s, logistic, and 
Poisson regressions are multiplicative models because in these 
models the joint effect of two or more factors is the product of 
their effects. For example, if the risk of death associated with dia¬ 
betes is twice as high as in nondiabetics and is three times as high 
in men as in women, diabetic men have a risk six times higher than 
nondiabetic women. However, this multiplicative effect still results 
from differences in the LP. 

A problem with interaction in multiplicative models derives 
from the definition of interaction as a measure of the departure 
from the underlying form of the model. This definition meets 
both statistical and biological interpretation of the interaction 
phenomenon as an amount of effect unexplained by the main 
terms. However, when this effect is measured, its statistical and 
epidemiological interpretation differs, depending on the model 
scale [10]. This is different from additive models where statisti¬ 
cal and epidemiological perspectives coincide: when the input 
effects are measured as differences, interaction parameters are 
differences chosen to measure departure from an additive model. 
In additive models an antagonistic interaction will result in a 
change lower than expected, i.e., less than additive [5], whereas 
a synergistic interaction will result in a change greater than 
expected, i.e., more than additive (Fig. 8). Statistical testing of 
this departure measures also the biologic phenomenon. When 
the effects are measured as ratios, interaction parameters are 
ratios, chosen to measure departures from a multiplicative 
model. In multiplicative models an antagonistic interaction will 
result in a change lower than expected (less than multiplicative), 
whereas a synergistic interaction will result in a change greater 
than expected (more than multiplicative). Statistical assessment 
of this departure tests whether there is a departure from 
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Table 1 

Hypothetical cardiac event data expressed as incidence rate ratio (IRR) by level of two risk 
factors: smoking and hypertension, where there is no interaction on a multiplicative scale 
but there is on an additive scale 




Hypertension 




Absent 

Present 

Smoking 

Absent 

1 Ref 

10 IRR 10 (1.2, 77.6) 


Present 

5 IRR 5 (0.5, 42.5) 

50 IRR? 


Legend: The 2x2 table shows the number of cardiac events per 1,000 person-years by presence of hypertension and 
smoking habit. As compared to subjects exposed to neither factor (absence of smoking and hypertension), the event rate 
in the presence of hypertension only is ten times as high; in the presence of smoking only is five times as high; and in 
the presence of both exposures is 50 times as high. On a multiplicative scale there is no interaction since there is no 
departure from the underlying form of a multiplicative risk model (Poisson in this case). In fact, 5x10 is exactly 50. 
Testing the parameter of the product term, the IRR is 1 (95 % Cl 0.5, 42.5) 

However, risk ratios can be assessed also on an additive scale, where the IRR is 50 (6.9, 359). The two models have the 
same Log-likelihood (-355.7). The only difference is the “contrast” defining the null hypothesis. In the multiplicative 
model, the interaction term is a product term assuming the value of 1 for exposed to both and 0 otherwise. The null 
hypothesis is the absence of deviation from multiplicative risk (and of course it is not rejected). In the additive formula¬ 
tion a factored set of terms is entered into the model with exposed to neither as reference category. The null hypothesis 
is the absence of difference on an additive scale (departure from additivity). The null hypothesis is rejected because the 
difference between exposed to both and exposed to neither prove to be larger than the sum of the other two differences, 
i.e., (50 -1) - [(10 - 1) + (5 - 1)] = 36. With permission Ravani et al., Nephrol Dial Transplant [12] 


multiplicativity and not the existence of a biologic phenomenon. 
Therefore, from the statistical viewpoint interaction depends on 
how the effects are measured, although a multiplicative relation¬ 
ship per se is evidence of biologic interaction as the resulting 
change in the response is greater than the sum of the effects 
(e.g., if diabetic men have a risk six times as high as nondiabetic 
women and the relative risk associated with the main effects are 
3 and 2, there is no deviation from the multiplicative scale but 
there is over-additivity). On the other hand, the choice of the 
model depends on the distribution of the response variable and 
cannot be dictated by the need to study interaction. However, 
there are ways to use multiplicative models and still assess the 
biological meaning of the phenomenon. 

Tables 1 and 2 show two different approaches for the analysis 
of interaction in risk data using multiplicative models. For exam¬ 
ple, in the Poisson model differences in the coefficients of the LP 
are interpreted as incidence rate ratios. When the presence of more 
risk factors is associated with multiplicative risk, there is no devia¬ 
tion from the underlying form of the model and the formal test for 
interaction is not significant. However, this should be interpreted 
based on the scale of measurement and biological knowledge. Lack 
of evidence of deviation from the multiplicative scale implies the 
existence of over-additivity, which requires a biological explana¬ 
tion. In this case biologic interaction can be assessed using catego¬ 
ries of covariate combination with exposed to none as reference. 
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Table 2 

Hypothetical cardiac event data expressed as incidence rate ratio (IRR) by level of two risk factors: 
smoking and hypertension, where there is antagonism on a multiplicative scale and synergism 
on an additive scale 




Hypertension 




Absent 

Present 

Smoking 

Absent 

1 Ref. 

7 IRR 7 (0.8, 56) 


Present 

3 IRR 3 (0.3, 28) 

14 IRR? 


Legend: The 2x2 table shows the number of cardiac events per 1,000 person-years by presence of hypertension and 
smoking habit. As compared to subjects exposed to neither factor (absence of smoking and hypertension), the event rate 
in the presence of hypertension only is seven times as high; in the presence of smoking only is three times as high; and 
in the presence of both exposures is 14 times as high. If the risk in the interaction cell is less than multiplicative, the 
estimate of the risk ratio in a multiplicative model is less than 1, giving the misleading impression of a qualitative interac¬ 
tion (antagonism). The formal test for interaction gives an IRR of 0.6 (0.05, 7); using a factored set of terms IRR is 14 
(1.8, 106). The additive model support a quantitative interaction because the number of cases in the group exposed to 
both factors is larger that the sum of the two differences, i.e., 14-l-[(3-l) + (7-l)] = 5. The two Poisson models have 
the same Log-likelihood of-166.7. With permission Ravani et al., Nephrol Dial Transplant [12] 


This method allows using any regression models to estimate 
departures from additivity without imposing the multiplicative 
relation implied by the structure of the chosen function [10]. 


5 Reporting 

The reporting of statistical methods and results in medical litera¬ 
ture is often suboptimal. A few tips are summarized in this section. 
More detailed checklists for reading and reporting statistical analy¬ 
ses are available in textbooks [11]. 

5.1 Methods In the methods section there should be a clear statement of the 

study question, detailed description of study design, how subjects 
were recruited, how the sample size was estimated to detect the 
strength of the main association of interest, how all variables 
(inputs and output) were measured, and how the main biases were 
prevented or controlled. The main outcome of interest and the 
main exposure should be clearly defined, since they determine the 
choice of the regression method and model building strategies. 

Number of observations as well as subjects and clusters should 
be described. This information is important to choose the proper 
statistical technique to handle correlation in the data. 

The statistical model used should be described along with the 
procedures adopted to verify the underlying assumptions. The type 
of function used to model the response should be clear, along with 
the epidemiological meaning of the coefficients, the main exposure 
and the potential confounders, why and how they appear (or not) 
in the final model, and why and how possible interactions were 
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taken into account. As different modeling strategies may provide 
different results using the same data, it is important to describe 
how the model was built, if any algorithm was followed or if a 
manual approach was chosen. The regression method to obtain the 
parameter estimates and to perform hypothesis testing should also 
be stated. Finally there should be a description of the diagnostics 
applied, how residuals were studied, how possible violations were 
excluded, handled or treated (e.g., with transformations), and 
whether any sensitivity analyses were carried out (running the 
model excluding some influential observations and checking if the 
results remained the same). Some models require special checking, 
e.g., proportionality assumption for the Cox’s model and fit tests 
for logistic and Poisson regression. The description of the statisti¬ 
cal package used for analysis should also be provided as estimation 
algorithms may be different in different packages. 

5.2 Results Since any model has a systematic component and a chance element, 

both should be reported. The fit component should inform about 
the relationship between exposure and disease, the strength of the 
associations (the coefficient estimate), and the effects of other vari¬ 
ables (including confounding and interactions). When interactions 
are included in the model the main terms must also be included. 

It is always important to understand the relevance of the effect. 
Thus, reporting a statistically significant association (a P value) is 
not enough. Point estimates (parameter estimates) should be 
reported along with a measure of precision (95 % Confidence 
Intervals) and the results of statistical testing (exact value of P 
rather than “<0.05,” unless it is <0.001). Often measures of effect, 
and not measures of disease, are of interest. For example reporting 
the average blood pressure in group A and B (with their 95 % CI) 
and the statement of statistically significant difference is not 
enough: what matters are the estimated difference and the 95 % CI 
of the difference. Finally, the variability in the response unexplained 
by the model is important (1 - R 2 statistics for linear model, likeli¬ 
hood for MLE methods) as well as the gain obtained from modeling 
as compared to the unconditional distribution (which can be 
appreciated from the table of patient characteristics). 

The next chapter examines the different multivariable models 
in more detail. 
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Chapter 6 


Longitudinal Studies 3: Data Modeling Using Standard 
Regression Models and Extensions 

Pietro Ravani, Brendan J. Barrett, and Patrick S. Parfrey 

Abstract 

In longitudinal studies the relationship between exposure and disease can be measured once or multiple 
times while participants are monitored over time. Traditional regression techniques are used to model 
outcome data when each epidemiological unit is observed once. These models include generalized linear 
models for quantitative continuous, discrete, or qualitative outcome responses, and models for time-to- 
event data. When data come from the same subjects or group of subjects, observations are not indepen¬ 
dent and the underlying correlation needs to be addressed in the analysis. In these circumstances extended 
models are necessary to handle complexities related to clustered data, and repeated measurements of time- 
varying predictors and/or outcomes. 

Key words Generalized linear models, Survival analysis, Repeated measures, Multiple failure times 


1 Introduction 


Longitudinal studies vary enormously in their size and complexity, 
and this has implications for data analysis. Open cohort studies are 
of relatively long duration and several patients can leave or join the 
cohort during follow-up. Closed cohort studies are usually shorter, 
and fewer participants leave the study during follow-up for reasons 
unrelated to the outcome of interest. 

At one extreme a large population of subjects may be studied 
over years. For example, Go et al. studied the relationship between 
reduced kidney function and the occurrence of hospital admis¬ 
sions, cardiovascular events, and death from all causes among 
1,120,295 adult members of the Kaiser Permanente Renal Registry 
[1]. Both levels of kidney function (exposure) and hospital admis¬ 
sions and cardiovascular events (repeatable diseases) were mea¬ 
sured multiple times during follow-up. The existence of the 
hypothesized association between exposure and outcome was 
tested taking into account these multiple measurements. 


Patrick S. Parfrey and Brendan J. Barrett (eds.), Clinical Epidemiology: Practice and Methods, Methods in Molecular Biology, 
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At the other extreme, some longitudinal studies follow up rela¬ 
tively small groups for a few days or weeks. For example Merten 
et al. compared the effect of sodium chloride and bicarbonate in 
reducing the occurrence of contrast-induced nephropathy defined 
as an increase of 25 % or more in serum creatinine within 2 days of 
contrast [2]. They studied 119 subjects with stable serum creati¬ 
nine levels of at least 1.1 mg/dL (>97.2 pmol/L) in a randomized 
trial. The existence of the hypothesized effect of bicarbonate 
(exposure-intervention) and reduced risk of contrast nephropathy 
(disease-outcome) was tested comparing outcome measured at 
study end by treatment group. 

These examples introduce an important feature of longitudinal 
studies: the relationship between exposure and disease can be mea¬ 
sured once or multiple times while participants are monitored over 
time. When data come from the same subjects or group of subjects 
the underlying correlation need to be addressed in the analysis. 

The present chapter provides introductory notes on traditional 
tools used to model outcome data when each epidemiological unit 
is observed once. These models include generalized linear models 
and survival models for time-to-event data. Extensions of these 
models are necessary to handle complexities related to clustered 
data, and repeated measurements of time-varying predictors and/ 
or outcomes. 


2 Generalized Linear Models 

Generalized linear models are “parametric” models because they 
estimate population characteristics (parameters) based on distribu¬ 
tional assumptions. These assumptions specify the shape of the 
input-output relationship and the distribution of the residuals 
guiding the choice of the model to study the data (see previous 
chapter). 

The family of generalized linear models is large, but all its 
members have the following attributes: 

1. A specific random component defining the conditional distri¬ 
bution of the response variable (Gaussian or normal for the 
linear model; binomial for logistic regression; Poisson for 
Poisson regression, for example). 

2. A linear function of the regressors (inputs), or linear predictor 
(LP), on which the expected (fitted) value of the response y of 
y depends. 

3. An “invertible” link function £f[y) = LP , which transforms 
the LP into the expectations of the response (y). 

The first attribute pertains to the random portion of the 
models; the LP and the link function to the systematic component. 
The LP, introduced in the previous chapter, has the form 
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Table 1 

Some standard link functions and their inverses 


Link 

LP = 

s(y) 

y = S ~ l (LP) 

Y range 

Var^lLP) 

Identity 

/V 

y 



LP 

- 00 , +00 

XL-it/W 1 ) 

Logit 

log. 

y/ 

*Y 

l-y 

k J. 

l/(l + e- Lr ) 

(0, 1, 2, n)/n 

y(l-y) 

Log 

logj 



e LP 

0, 1, 2, ... 

y 


Distribution Model 

Gaussian Linear 

Binomial Logistic 

Poisson Poisson 


LP is the linear predictor; £f represents an invertible link function; y is the expected value of the response; y is the 
observed value of the response; Var(jy | LP) is the variance of the response given the predictors. An invertible link 
function is a function linking y and LP allowing going back from LP to y using its inverse 


LP = /?o + P\X\ + P 2 X 2 + * • * + Each independent variable “vs” 

(from 1 to “F’) may be quantitative or qualitative inputs, their 
transformations, polynomial terms, contrast generated from fac¬ 
tors, and so on. The inputs or regressors can assume different val¬ 
ues, whereas the regression coefficients “/?s” (from 0 to “F’) are 
constant. In generalized linear models LP is the “argument” of the 
specific function of that model. The standard link functions, the 
inverses and the conditional error distributions of the three most 
popular generalized linear models (linear, logistic, and Poisson) are 
shown in Table 1. For example, linear models have an identity link 
function and normally distributed errors. This means that the 
observed values of the response (e.g., left ventricular mass index) 
are modeled as a function of LP (including for example blood pres¬ 
sure and body mass index) and what remains to be explained after 
the model is fitted to the data (the error component) is normally 
distributed with mean zero and some non-zero variance. It is pos¬ 
sible to go in the opposite direction (using the inverse) from the 
fitted values (expected averages) and make predictions about future 
observations taking into account the residuals. In logistic regres¬ 
sion a logit function links the binary data (e.g., presence of left 
ventricular hypertrophy) to the LP (with one or more inputs), the 
inverse function is the logistic function that allows estimating prob¬ 
abilities of future outcomes from the LP, and the distribution of 
the conditional response is binomial. In Poisson regression the 
natural log links the expected counts (e.g., death rate or hospital¬ 
ization) to the LP, its inverse is the exponential that allows estimat¬ 
ing future incidence rates based on the LP, and the conditional 
distribution of the response follows the Poisson distribution. 

The technical explanation of these attributes is complex and 
beyond the scope of the present chapter. Also the Gaussian, 
Binomial and Poisson distributions will not be described. However, 
the important aspect to understand here is that each model has a 
specific underlying function, which is invertible, and a specific 
distribution of the “unexplained” variability of the response and, 
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consequently, a specific distribution of the conditional response 
given the inputs. The type of function has implications on the 
meaning of the parameters estimated by the model. The distribu¬ 
tion of the residual is important to check if the model fits the data 
well, i.e., to exclude the existence of different links between input 
and output disregarded by the model. 


2.1 General Linear 
Model for Quantitative 
Responses 

2.1.1 Structure 
of the Linear Model 


Linear regression is appropriate to model quantitative response 
variables. For example, in a cohort study of chronic kidney disease 
progression and patient survival, glomerular filtration rate (GFR) 
was inversely related to asymmetrical dimethylarginine (ADMA) at 
baseline, being on average 0.17 ml/min per 1.73 m 2 lower per 
0.1 pmol/L of ADMA [3]. This inverse relationship suggests that 
one variable tends to change in the opposite direction of the other, 
although only 48 % of the change in GFR was explained by the 
systematic component of the multivariable model (R 2 statistics). 

The systematic component of general linear models (LP) 
includes the model intercept (“/V’) and the estimated effects (e.g., 
change in GFR) associated with the predictor (e.g., ADMA) and 
other input variables in the model (“/?*”). The random component 
is represented by the residuals or differences between the observed 
response values and their expectations and is summarized by an 
unexplained variability of GFR as high as 52 % in the ADMA study 

[3] . Examples of different prediction ability of the same set of 
covariates used for five cardiac outcomes are described in the 
Multiethnic Study of Atherosclerosis. In this study the R 2 statistics 
decreased from almost 60 % for the model of left ventricular mass 
to less than 20 % for the model of left ventricular ejection fraction 

[4] . However, the most important use of the R 2 statistics is to 
compare so called “nested models.” These models have the same 
response and fit the same data (the same set of observations). The 
best model (i.e., predictors to include, confounding and interac¬ 
tion terms as well as possible transformations to consider) is 
selected considering the best fit in terms of improvement of the R 2 
statistics (reduction of the residual variance). 

Linear models can include one or more inputs. For example, t 
test and one-way ANOVA are special cases of univariable linear 
models where the input has respectively two and more than two 
possible levels. The characteristics of the general linear model are 
summarized in Table 2. 


2.1.2 Meaning 
of the Coefficients 
in Linear Regression 


The regression coefficients of the linear model estimate the average 
change in the output per unit change in each input. Therefore, they 
are “differences” in the average response by level or unit of expo¬ 
sure (Table 2). For example, Heine et al. studied renal resistance 
indices in subjects with chronic kidney disease not yet on dialysis 

[5]. Among the independent predictors included in the final model 
of resistance indices there were age, glomerular filtration rate and 
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Table 2 

Characteristics and conditions of validity of the general linear model 


Meaning 
of “A” 


The regression coefficients of the linear model are differences corresponding 
to average changes in (output) per unit change in “x ” (input “k”) 


Gauss-Markov 
assumptions 


1. Linearity: The shape of the systematic component (LP) must be reasonably linear; 
mathematically LP is a sum of input(s) raised to the first power, each multiplied 
by its parameter. This means that it must be possible to quantify the amount of 
linear change of the output per unit change of the input(s) since the parameter 
estimates are constant and apply over the whole range of the predictors 

2. Normality : Residuals are normally distributed around the fitted line with 
mean = zero and constant variance (homoscedasticity). This means that what 
remains to be explained around the fit is similarly unknown across all input(s) 
values 

3. Independency: The residuals must be independent; this is possible only if the 
observations are independent. This means that what remains to be explained 
around the fit is unknown independent of the process of measurement 


Purposes of the 
linear model 


1. To predict the value of an output for given input(s) based on estimates obtained 
from a sample 

2. To adjust the effects of an input variable on a quantitative output variable for the 
effects of other extraneous variables (confounders) 

3. To assess whether the effect of an input is unaffected or changes by level of 
another input variable (interaction) 


diabetes, for example. The first coefficient estimated in the model is 
the model intercept, & 0 = 50.8 (95 % Confidence Intervals 42.6- 
59.1; P<0.001). This parameter is the value of the output when 
everything else is “zero.” For example when diabetes is coded “0” 
(diabetes absent) and “1” (diabetes present) and nothing else is in 
the model, the intercept estimates the average output value in those 
without diabetes. However, when quantitative variables (age for 
example) are in the model this quantity per se has meaning when 
averages values of those inputs are considered (see previous chap¬ 
ter). The P value is the probability of falsely rejecting the null 
hypothesis that the coefficient is zero (tested by means of a r-test, 
for example). The coefficient associated with diabetes (5.59, 95 % 
Cl 2.1, 9) means that diabetics have on average 5.5 higher values of 
the response. Resistance indices tend to be higher in older subjects 
(0.18 per year of age) and lower in those with more preserved glo¬ 
merular filtration rate (-0.07 per ml/min). 

2.1.3 Model Check General linear models are considered the paradigm of all statistical 

models used in epidemiological research but they cannot be used 
in all circumstances. Graphical and formal statistical tests must be 
performed to check if the assumptions of the linear model are rea¬ 
sonably met. More details can be found in specific textbooks [6]. 
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When the response variable is a binary outcome that is either pres¬ 
ent or absent (sometimes termed “success” and “failure”), then an 
appropriate analysis is often binary logistic regression. The out¬ 
come might be the presence of a disease in a survey or the occur¬ 
rence of an event in a prevention study, such as contrast nephropathy 
[2]. Researchers are interested in identifying factors associated 
with the event. However, since it is not possible to predict whether 
an event will occur or not with certainty, what researchers do is 
seeking for factors associated with the probability (or risk) that an 
event happens. 

2.2.1 Structure Since the logistic function may seem quite complex the character- 

of the Logistic Model istics of logistic regression may be more easily introduced by showing 

why linear regression cannot be used to model probabilities. 
Consider a sample of 100 subjects on whom the following variables 
have been measured: age (in years) and coronary heart disease 
(present or absent) [7]. Researchers may want to study disease status 
(y) as a function of age (x). Plotting disease status over age in years 
(Fig. 1, left plot) makes all observation fall on one of two possible 
values representing the absence of the disease (y= 0) or the pres¬ 
ence of the disease (y=l). This plot shows the binary nature of 
the response and suggests a possible association, as younger indi¬ 
viduals tend to fall on the bottom line of no disease. However, the 
large variability of y at all ages (i.e., subjects with the same age with 


2.2 Logistic Model 
for Qualitative 
Responses 
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Fig. 1 Study of the presence or absence of a disease as a function of age: The first scatter-plot uses the original 
measurements on 100 study subjects (left). The values of the response lie on two lines indicating presence 
(y=1) or absence (y=0) of the disease. However, subjects tend to be younger when y=0 (older when y=1). 
Since within the age values subjects may be on either line (high variability) the possible relationship cannot be 
easily appreciated. The second curve represents the proportion of subjects with the disease over age category 
mid-points (right). The relationship between the two variables is more clearly appreciated, but the curve is 
sigmoid rather than straight. With permission from Ravani et al. [39] 
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and without disease) does not allow appreciation of the relationship. 
Some variability can be removed categorizing the input and plot¬ 
ting the mean output variable (proportion) value for each input 
level (Fig. 1, right plot). The relationship becomes appreciable, but 
the form of the ideal line passing through the points differs from 
what might have been obtained if the response was quantitative 
first, the shape of the line is sigmoid rather straight; second, there 
is no information about the errors. The logistic model rather than 
the linear model is appropriate for this type of relationship (see 
previous chapter). In fact in linear regression the conditional 
response may take any value as LP ranges from +oo and-oo. 
Conversely, with dichotomous data the conditional mean (proba¬ 
bility) must be >0 and <1. The right plot of Fig. 1 shows that this 
mean approaches 0 and 1 “gradually.” In other words, the S-shaped 
curve implies that the change in the expected value of the response 
per unit change in the predictor(s) becomes progressively smaller 
as the response gets closer to 0 or 1. This is what the logistic func¬ 
tion does: transform a continuous variable with a range between 
+oo and-oo (LP) into a response ranging from 0 to 1 (probability). 
Everything else already introduced about the LP (concepts of 
linearity and meaning of the coefficients) is true also for logistic 
regression, although the epidemiological meaning of the parame¬ 
ters (coefficients of the LP) changes because of the specific function 
of the logistic model. The argument (LP) of the logistic function is 
an “index” combining the contributions of several risk factors, and 
the logistic function of LP represents the individual risk of the dis¬ 
ease for a given value of LP. 

Another important difference with linear regression is the dis¬ 
tribution of the errors. In linear models the observed values of the 
response (given the values of the individual inputs) are modeled as 
function of the LP and the error term is in the model (y\X= LP + e). 
In logistic regression (logit[y | X] = LP) the observed values of the 
dependent variable (y failures in n trials) are not in the equation, 
but their expectations are linked to the model by the binomial dis¬ 
tribution. Of course the values of the population parameter (prob¬ 
ability) are unknown. The estimated or fitted values are used 
instead in the modeling process using Maximum Likelihood 
Estimation (see previous chapter). The remaining general princi¬ 
ples of regression analysis are the same as in linear regression. 
Therefore, when the response is dichotomous: 

1. The conditional mean of the regression equation must be 
bounded between zero and one (and this is satisfied by the 
logistic formulation of LP). 

2. The binomial and not the normal distribution describes the 
distribution of the errors. 

3. The principles of analysis using linear regression also guide 
logistic regression. 
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Table 3 

Example of simple logistic regression and meaning of the coefficients 



Failure 

Success 




Exposed (E: x= 1) 

30 

20 


a 

b 

Unexposed (U: a;=0) 

10 

40 


c 

d 


40 

60 




Pre-test odds of failure = 

(a+ c)/(b+ d) = 

40/60 = 

0.666 



Post-test odds among E = 

a/b = 

30/20 = 

1.500 



Post-test odds among U = 

c/ d = 

10/40 = 

0.250 



Odds ratio = 

(ax d)/(cxb) = 

1.5/0.25 = 

6.000 




Before considering the predictor (x), the (null) model is Logit (n i ) = In \jz i / (l - n i )J which is 
Logit(^ ) = (3 0 = ln(40 / 60) = -0.405 . This gives the unconditional probability estimate 
n i = exp (P 0 ) / [l + exp(/? 0 )J = 0.4 , or overall risk. However, the conditional probability changes when x is consid¬ 
ered. The (full) logistic model is Logit (n i | x i ) = +/3 x x t , where bo* = -1.386 represents the log-odds among unex¬ 
posed, and f5 x —1.791 is the log-odds ratio (OR). In fact, when v=0, the odds are exp(-l.386) = 0.25; when x=l the 
odds are exp(-1.386 + 1.791) = 1.5. Therefore, (3 X represents the difference of the log-odds between exposed and unex¬ 
posed, or log-OR. The exponentiated “/L” gives the OR of exposed vs. unexposed 


2.2.2 Meaning The geometrical meaning of the intercept and the other coeffi- 

of the Coefficients cients of the LP in logistic regression are similar to those of linear 

in Logistic Regression regression: they define the position (value of risk corresponding to 

0.5) and shape (how much the risk changes as the input values 
change) of the logistic function. However, differences in the argu¬ 
ment of a logistic function have a specific epidemiological meaning 
since they are related to the odds ratio. This can be best shown 
using an example. Let us consider a study of a disease such as myo¬ 
cardial infarction and an exposure such as hypertension defined as 
present or absent. From the 2 by 2 table shown in Table 3 it can be 
seen that before studying the association between exposure and 
disease the only available data is that there are 40 failure events 
among 100 individuals (risk 0.4). Running a “null model” corre¬ 
sponds to ignoring the information related to the exposure: the 
intercept of the null model informs about the overall risk just as the 
intercept of a null linear model informs about the overall mean of 
the continuous response in the sample. The “logit” of the uncon¬ 
ditional probability (log-odds) is the intercept of the null model 
since there is no input in the LP. The “exponentiated” intercept is 
the pre-test odds for disease. Using the logistic function (inverse) 
it is possible to estimate the overall risk. However, it may be 
hypothesized that the risk varies by level of exposure. This can be 
tested including the predictor in the model. The (full) model has 
a new intercept. This new intercept carries both the previous 
knowledge (pre-test odds) and the knowledge gained from con¬ 
sidering the exposure “v” (likelihood ratio of being unexposed). 
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Of course, the logistic model now also contains the coefficient of 
the input, which has the meaning of the log-odds ratio. In fact 
when the exposure is present the LP (logit of probability or log- 
odds) is the sum of /? 0 and fi x . When the exposure is absent LP 
contains only /? 0 . The difference in these two log-odds, once expo¬ 
nentiated, is simply the odds ratio (OR) of exposed versus unex¬ 
posed, in the example exp(1.8) = 6. When inputs are categorical or 
continuous the associated coefficient is the effect per unit change 
of the variable. For example, if the coefficient is 0.3, the OR associ¬ 
ated with each level increase of input (category or unit of expo¬ 
sure) is exp(0.3) = 1.35, which means that each level change in the 
predictor is associated with a 35 % higher odds for disease (as com¬ 
pared to the previous level). If the coefficient has a negative sign, 
the predictor is associated with reduced odds for disease. For 
example, if the coefficient of serum albumin in g/L is-0.2, the OR 
is 0.81, i.e., there is a 19 % reduction in the odds for the event per 
each g/L increase in serum albumin. If more inputs are in the 
model (including confounding and interaction terms), each odds 
ratio is adjusted for the effect of all the other independent variables 
in the model. This is clear from the little algebra shown above, as 
they will appear in both LP terms of the specific difference in ques¬ 
tion. Of course due to the exponential form of the model the indi¬ 
vidual odds change on a multiplicative scale as the values of the 
covariate change. For example, if the OR for a disease associated 
with male gender is two and the OR for the same disease is three 
in smokers, the OR for a male subject who smokes as compared to 
a non-smoking woman will be six. In linear model effects are addi¬ 
tive instead because of the identity function linking the response to 
the LP. However, it is important to remember that the risk index 
(LP) changes linearly as the covariate levels change and that it is 
possible to interpret risk change (including the case of interaction) 
on an additive scale even using multiplicative models (see previous 
chapter). 

Using the full model the background risk can be estimated 
from the odds among unexposed [R=0 / (1+0)]. However, risk 
estimation from a logistic model only makes sense in longitudinal 
studies. If subjects were selected based on disease (case control 
design) the likelihood of exposure is assessed “retrospectively” and 
the estimate of any risk including the basal risk is biased. The same 
applies to relative risk (or ratios of estimated risks). For this reason 
in non-longitudinal designs logistic regression is used to estimate 
odds ratios only rather than risk and risk ratios. 

2.2.3 Model Check 

and Other Issues 


There are a number of ways to check if the model fails to describe the 
data well. Some of these are based on the same principles as those 
applied to linear regression because the log-odds for the response 
must be linearly related to the predictors. Graphical assessment 
of the residuals is important to assess linearity and study outliers. 
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In logistic regression (as well as in Poisson and in Cox’s regressions) 
the model likelihood has the same meaning as the R 2 statistic in linear 
models: the higher the value the better the fit (usually the log-likeli¬ 
hoods of nested models are compared although it is possible to esti¬ 
mate a pseudo— R 2 also for nonlinear models). 

Specific to logistic regression are the goodness-of-fit test and 
the maximum number of parameters that can be simultaneously 
estimated. The goodness-of-fit test compares the observed proba¬ 
bilities with ones predicted by the model using a chi-squared test. 
Evidence of lack of fit causes over-dispersion (extra-binomial varia¬ 
tion), which means that the standard errors of the model may not 
be valid. This may arise because some important covariate has been 
omitted or because outcome data are correlated (e.g., repeated 
binary outcome in the same subject). 

The maximum number of model parameters is another impor¬ 
tant issue in logistic regression. Hosmer and Lemeshow indicate 
that a minimum of ten events (the lowest between successes and 
failures) per parameter is necessary to avoid biased variance estima¬ 
tion [8]. A detailed discussion of logistic regression including spe¬ 
cial topics and model extensions can be found in specific texts [9]. 

2.3 Poisson Model Poisson regression is used when the risk of an event for an indi- 

for Counts vidual is constant and small but the number of individuals is large, 

and thus the total number of events is considerable. The outcome 
variable in Poisson regression is a count of independent events over 
a period of time at risk, such as the number of deaths over years of 
follow-up. This count is a discrete quantitative variable. The prin¬ 
cipal covariate in the model is the time at risk, which is recorded for 
each observation (exposure time). While logistic regression models 
probabilities, Poisson regression models rates (A = count of event/ 
the number of times event could have occurred). 

Rates and Rate Ratios : The risk of an event is the expected num¬ 
ber of events occurring in a group of people during a specified 
period of time. Risks are probabilities, dimensionless and with 
possible values ranging from 0 to 1, and can be estimated in 
short studies where subject follow-up is approximately complete. 
Longer studies estimate incidence rates instead of risks because 
when the study duration is long not all subjects are observed for 
the same amount of time (e.g., one may be observed for 10 years, 
another for 20 years, and so on). Rates have the same numerator 
as risks but person-time of observation as their denominator 
(e.g., one person observed for 10 years and another for 20 years 
would contribute for a total of 30 person-years of follow-up or 
30 persons per unit time, i.e., 30 person-years). Therefore, rates 
treat one unit of time as equivalent to another, regardless of 
whether these time units come from the same subject or different 
individuals. Furthermore, incidence rates have the dimension of 
1/time and range from 0 to +oo. Rates can exceed 1 (100 %) 
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because they do not measure the proportion of the population 
that experience disease but the ratio of the number of events to 
the time at risk for disease. Since the denominator is measured in 
time units, the numerical value of the incidence rate depends on 
the chosen time unit. For example, if eight cases occur in 36 sub¬ 
jects observed for 1 month, then the rate is 0.22 cases per person- 
month or 2.66 cases per person-year, but the two expressions 
measure the same rate. Finally, if the underlying risk is constant 
and small (e.g., less than 0.2) it can be estimated as the product 
of the rate estimate and the observation time. For example, if 
1,000 subjects are followed for 10 years and experience a mortal¬ 
ity rate of 0.01 per person-year (0.01 year -1 ), the risk can be esti¬ 
mated as 0.01x10 or 0.1 over 10 years (each individual has a 
probability of 10 % to die in 10 years). However, this calculation 
neglects the shrinking of the population as deaths occur over 
time, as the same mortality rate applies to a steadily smaller popula¬ 
tion at risk (exponential decay). The risk approximation of the 
incidence rate does not work well for high risk or very long time 
duration. Fortunately, risks of interest to epidemiologists are 
usually small and epidemiological studies not too long. For 
these reasons rate estimates approximate true risks reasonably 
well. For the same reasons, incidence rate (IR) ratios are interpre- 
table as risk (R) ratios since Ri/R 0 = (IRi x time)/ 
(IRqX time) = IRi/IRq. An example will clarify why this is impor¬ 
tant to Poisson regression. Suppose that “d” independent events 
are observed during “ n ” person-years, where d is small as com¬ 
pared to n (note: person-years not persons). For example, in the 
Framingham Heart Study dataset there are 4,699 individuals and 
104,461 person-years of follow-up [10]. The observed incidence 
ofcoronary events is \ - d Y / n Y - 823 / 42,688 = 0.019279year _1 
or 19.27 per 1,000 person-years in men (1) and 
A 0 = d Q / = 650 / 61,773 = 0.010522year _1 or 10.52 per 
1,000 person-years in women (0). The incidence rate ratio of men 
vs. women is IRR = \ / A 0 = 1.832 . Poisson regression is used to 
estimate the IRR associated with “one unit change” of the 
predictor. 

2.3.1 Structure 

of the Poisson Model 


Rates represent the number of events d expected to occur in n 
person-time (risk over time). The link function in Poisson regres¬ 
sion is a simple log transform whose inverse is the exponential 
function. The Poisson model is log (A | X) = log(V / n\X) = LP 
in terms of rates or log(V | X) = log(^) + LP in terms of counts. 
Using the inverse the expected number of events is 
d | X = exp jlog(^) + LP}. Since the risk is assumed to rise directly 
with n, the coefficient for log( n) is fixed at one and is known as model 
offset. The important aspect to note here is that as for logistic regres¬ 
sion, the expected rather than the observed counts are modeled. 
The Poisson distribution links the observed counts to their expec- 
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tations. Assuming that a Poisson process underlies the event of 
interest, Poisson regression finds ML estimates of the (3 
parameters. 

Let us imagine that ten subjects are observed during a 1-year 
study divided in one-month bands (unit of n); that the exposure 
times for the ten subjects (n) are: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 bands; 
and that three people die during the study period. Thus, d= 3 in 55 
1-month bands, each of 0.0833 years duration (1/12). The event 
rate is 3/(55 x 0.0833) = 0.655 per year or, in terms of count, 655 
deaths per 1,000 person-years. If the data had been updated every 
day, then the length of the band would have been 1 day and there 
would have been 55 x (365.25/12) = 1,674 bands (units of n) of 
0.002737 year duration. However, the rate would have been the 
same, i.e., 3/(1,674x0.002737) = 0.655. Thus, assuming all 
events occur at the end of the band (which is often a strong assump¬ 
tion), the estimates are not affected by the precision of the mea¬ 
surement. However, time-to-event data are more often modeled 
rather than rates when individual data and more precise time mea¬ 
surements are available. 

From the above example, rates can be viewed as probabilities 
over time. In fact X = d/n , where n represents the number of bands 
times their duration h and X can also be expressed as the expected 
event count per some multiple of unit time (e.g., 0.05 years 1 = 5 
per 100 person-years). Poisson showed that the binomial distribu¬ 
tion of the number of failures d converges to the Poisson distribu¬ 
tion of the event count A as n increases provided that the expected 
number of failures nn = d is constant. In fact when n is large (. n=d/n 
is small) then nn (expected number of failure events) approaches X 
(expected event count). This can be shown also in terms of likeli¬ 
hood function of X. In fact MLE is also the regression method of 
Poisson regression. Further details can be found in specific text¬ 
books [10, 11]. 

2.3.2 Meaning In the Framingham example of 1,473 deaths in 104,461 patient- 

of the Coefficients years the overall event rate was 0.01410096 per person-year. 

in Poisson Regression Therefore, the yearly probability of death per person was 0.0141 

(expected count 141 per 10,000 person-years), indicating that the 
event measured in a large sample occurs with a small probability. 
This expected rate per person per unit time is the exponentiated 
coefficient of the null model log (A) = LP = /3 0 . To estimate the 
effect of gender on mortality the covariate Xj indicating male (i= 1) 
or female gender ( i = 0) is introduced into the model. The other 
pieces of information needed for this model are ni, the group 
exposure time; di, the number of deaths per group; and the true 
group probability of death ni. The model will estimate the relative 
risk of death RR = n 1 / n 0 from the estimated n\ and 7r 0 as 
explained in Table 4. 
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Table 4 

Simple Poisson Regression using the Framingham data [10] 


Covariate (gender) 

Failures 

Exposure time (Pt-year) 

Rate per unit time (1) 

Male (# 1 ) 

d, =823 

ni = 42,688 

Xi = di/ni = 0.01927 

Female (aj 0 ) 

4 = 650 

^o = 61,773 

2o = 4/^o = 0.01052 


Derivation of the Poisson model: Since n is relatively small Xi approximates ni. Then the expected number of deaths is 
E^d l \x l ) = n l n l in males and Eyd 0 | x 0 ) = Ti 0 7i 0 in females. Since the relative risk is RR = n l / 7i 0 then 

log[ii(4 | = log^J + log[^] and log[F(4 | jq)] = log[^] + log[RR] + log[;r 0 ] .Renaming logfRR] with 

“/?” and log(^o) with “a” we have the Poisson model: log[ii(4 | a^)] = log[^] + a + ft and in more general form: 
log [E(d, Ix,)] = log[»,] + a + x, P . Therefore, the meaning of the coefficient (X is P = log [fTR] = log [ 77 ^] - log[^ 0 ] 
and RRis estimated as exp(/?) 


2.3.3 Model Check There are a number of ways to check if the model fails to describe 

the data well based on graphical assessment of the residuals and 
formal tests. The goodness-of-fit test is important also for Poisson 
regression. Evidence of lack of fit causes over-dispersion (extra- 
Poisson variation), which means that the standard errors of the 
model may not be valid. This may arise because some important 
covariate has been omitted or because outcome data are correlated. 
The model likelihood has the same meaning as in logistic regres¬ 
sion. Further details can be found in specific textbooks [10, 11]. 


3 Models for Time-to-Event Data 


3.1 Survival Data In many clinical studies, the main outcome under assessment is the 

time to an event of interest. This time is called survival time, 
although it may be applied to the time “survived” from complete 
remission to disease relapse or progression as equally as to the time 
from diagnosis to death. For appropriate outcome measurement 
and analysis in survival studies it is especially important to define 
precisely the event and when the period of observation starts and 
finishes. For example, in studies of survival post-myocardial infarc¬ 
tion, time is recorded from a starting point or “time zero” (the 
date of diagnosis of myocardial infarction), and the observation 
continues for each subject until either a recurrent fatal or nonfatal 
event occurs, the study ends, or further observation becomes 
impossible. 


3.2 Key A critical aspect of survival analysis arises largely from the fact that 

Requirements some individuals have not had the event of interest at the end of 

for Survival Analysis the follow-up and their true time to event remains unknown. This 

phenomenon is called censoring and it may arise as follows: (a) a 
patient has not (yet) experienced the outcome event by the study 
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3.3 Functions 
of Time-to-Event Data 


close date; (b) a patient is lost to follow-up during the study period 
(e.g., due to transfer to another center or for consent withdrawal); 
(c) a patient experiences another (competing) event that makes 
further follow-up impossible (e.g., heart transplantation, a new 
health problem, or even a car accident). Censored observations are 
those who survived at least as long as they remained in the study 
but for whom the survival times are not known exactly. Such right- 
censored survival times underestimate the true (but unknown) 
time to event. Although some data can be censored in other ways, 
most survival data are right-censored. If the event occurred in all 
individuals, other methods of analysis would be applicable. 
However, the presence of censoring and the distribution of the 
failure times make survival analysis necessary to study time to event 
data [12, 13]. 

The analytical tool used to study survival data assumes that if 
censoring occurs it occurs randomly and is unrelated to the reason 
for failure (uninformative or independent censoring principle). In 
practical terms, this means that censoring must carry no prognostic 
information about the subsequent survival experience. Note that 
the assumption would be violated if, just prior to failure, subjects 
are highly likely to leave the study or if the dropout rate between 
groups is differential. Other key requirements for a valid survival 
study are a follow-up duration based on the disease severity and 
thus sufficient to provide enough power to the study (sufficient to 
capture enough events); homogeneous cohort effect on survival 
(similar survival probabilities for subjects recruited early and late in 
the study); and independence of the failure times (absence of cor¬ 
relation in the data) since estimates of the /? parameters are found 
by ML methods. 

Survival data are generally described and modeled in terms of three 
related functions, namely, the survivor, the hazard, and the cumu¬ 
lative hazard functions. They are different functions of the LP 
meant to summarize the information on the outcome components 
described above (time zero, end date, and censor status) in one 
response variable (Table 5). The survival probability (cumulative 
survival probability or survivor function) is the probability (from 1 
at t= 0 to 0 as time goes to infinity) that an individual survives from 
“time zero” up to a specified future time t (observation end). 
Survival probabilities at different times provide essential summary 
information from time to event data. For example, a survivor func¬ 
tion of 0.85 at 30 months informs that 85 % of the subjects 
(observed from t= 0) are still event free at 2.5 years (risk of 0.15 at 
2.5 years). The hazard is the instantaneous probability that an 
individual who is under observation at time t has an event at that 
time. So it is a rate rather than a probability, or a probability over a 
time interval, though very small. Put another way, it gives the 
instantaneous potential for the event to occur, given that the 
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Table 5 

Functions used to model survival data 


Function 

Symbol 

Reading 

Definition 

Range 

Sf 

S(t) 

S at t 

Cumulative S probability at t 

[0,1] 

Hf 

X(t) 

Hztt 

Conditional H rate at t 

[0,oo] 

CH/ 

m 

CH at t 

Cumulative Hzxt 

[0,oo] 


S survival, t any point in time function, H hazard, CH cumulative hazard 


subject has survived up to that instant (conditional rate). As 
opposed to the survival probability, which is dimensionless, the 
hazard has units of 1/time. Also, in contrast to the survivor func¬ 
tion, which can only decrease over time, the hazard function can 
increase, decrease, remain constant or vary with different shapes. 
For example, at 2.5 years an individual may have a hazard of 0.003 
(e.g., events per person-months), but later on this might be higher 
or lower. This hazard is like a speed, with the risk of failure over 
time instead of distance covered over time, and may assume differ¬ 
ent values over time (from 0 to+oo) independent of the average 
value calculated in an interval. There is a clearly defined relation¬ 
ship between survival and hazard functions, given by calculus for¬ 
mulae incorporated into most statistical packages and each can be 
determined automatically from the other. 

However, unlike the survival function, estimation of the haz¬ 
ard is not simple. Another quantity, the cumulative hazard , is cal¬ 
culated instead as an intermediary measure for estimating the 
hazard. The cumulative hazard at t is the integral of the hazard (or 
the area under the hazard function between times 0 and t) and is 
also mathematically related to the other two functions. To under¬ 
stand the concept it is useful to go back to the speed example. If a 
person faced a hazard rate of death of 0.1 events per hour (a speed 
of 0.1 mph), then the cumulative hazard is such that were that rate 
to continue for 2 days (the speed constantly at 0.1 mph) we would 
expect 4.8 failures to occur (4.8 miles traveled) in 2 days. Since an 
integral is indeed just a sum, a cumulative hazard is not unlike the 
total number of times the subject “would fail” over the interval 
period (cumulative force of mortality). 

To compare hazards, survival functions or times across groups, 
there are different approaches more or less free from specific distri¬ 
butional assumptions (Table 6). Furthermore, some parametric 
models have an accelerated failure time metric (log-time metric), 
i.e., the estimated coefficients (the covariate effects) are interpre- 
table as log-time ratios and some have both the proportional haz¬ 
ard and the log-time interpretation. The two interpretations are 
different. The proportional hazard metric focuses on the actual risk 
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Table 6 

Forms of survival analyses 


Form 

Parameters 

Example 

Metrics 

Interpretation 

Exp(/J) 

Non 

parametric 

None 

Kaplan-Meier 

NA 

Survival probabilities can be 

compared across covariates levels 

NA 

Semi- 

parametric 

Effects 

Cox’s model 

PH 

How the risk changes by covariate 
values 

HR 

Parametric 

Effects 
and X( t) 

Gamma 

Log-normal 

Exponential 

Weibull 

AFT 

PH/AFT 

Flow t changes by covariate values 

Both PH and AFT 

TR 

HR or TR 


Exp(/?), is the number e to the power of /?, the estimated value of the coefficient, NA not applicable, PH proportional 
hazards, HR hazard ratio; gamma, log-normal, exponential, and Weibull are the names of some parametric regression 
models, AFT accelerated failure time, TR time ratio 


process (the hazard function) that causes failure and how the risk 
changes with the value of the covariates in the model. The acceler¬ 
ated failure time metric gives a more prominent role to time in the 
analysis (how the survival time changes with the value of the covari¬ 
ates in the model). 

Poisson regression can also be used to study survival times, 
when individual information on exposure time and event occur¬ 
rence is available. However, Poisson regression models rates, which 
are assumed to be low and constant (with variance equal to the 
mean). When rates are not constant other approaches can be used. 
For this reason Poisson regression is usually applied to model event 
counts over a specified time interval (events per person-year) using 
aggregated data and can be a good way to simplify complex sur¬ 
vival models [12, 13]. 

3.4 The Cox’s Model The Cox’s model is by far the most commonly used procedure in 

current practice [14]. It is a semi-parametric model since it formu¬ 
lates the analysis of survival data where no parametric form of the 
hazard function (output) is specified and yet the effects of the 
covariates (inputs) are parameterized (i.e., modeled based on 
assumptions) to alter the baseline hazard function (the hazard for 
which all covariates are equal to zero). The Cox’s model makes 
estimation possible assuming that the covariates multiplicatively 
shift the baseline hazard (Fig. 2). 

Besides the ease of coefficient interpretation, freedom from 
distributional assumption is the greatest advantage of Cox’s regres¬ 
sion. The cost is a loss of efficiency (precision) since the parameters 
(coefficients) are estimated comparing subjects at the times when 
failures happen to occur whereas parametric models maximize the 
use of the information in the data. 
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3.4.1 Structure 
of the Proportional 
Hazards Model 


HR varying over t 

"1 —■* j p m 

No HP 


*o(t) 


0 10 20 
Analysis Time 

Fig. 2 Hazards proportionality: The curve on the left shows proportionality of the 
hazard ratio (HR) over time (e.g., the effect of the covariates is constant); the right 
curve shows a HR varying over time. According to the Cox’s model, the hazard at 
t “/1(f)” ( continuous line) is a function of the baseline hazard “A 0 (t)” (dotted line) 
times the hazard ratio “exp(/?).” It can be seen graphically that under the hazard 
proportionality (HP) assumption (left curve) the distance between the two haz¬ 
ards is constant over time, this distance corresponding to “exp(/?)” (HR after log 
transform). This is why estimation is made assuming that the covariates multipli- 
catively shift the baseline hazard function (on the exponential scale). The propor¬ 
tionality assumption can be verified based on these concepts. With permission 
from Ravani et al. [39] 
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The individual hazard at time t is X (t | X) = X 0 (t )e LP , which means 
that the hazard experienced at any time during follow-up depends 
on a basal hazard and the exponentiated LP. No form of the baseline 
hazard is specified, but there can be groups with different base¬ 
line hazard (2 0 ). For example if the effect on survival of some cate¬ 
gories (gender, age levels for example) is not of interest, a stratification 
variable can be used to specify the model 4(^) = 4oWe LP , 
with indicating the stratum. Note that “LP” remains the same. 
The model, however, allows the basal risk to vary. Since in Cox’s 
regression the intercept /? 0 is in the A 0 rather than in the LP, the 
stratified Cox’s model can be thought of as a multiple regression 
with intercept varying by stratum. In both cases the difference in 
the LP between two groups of subjects (e.g., men and women) in 
terms of hazard (e.g., cardiovascular event), does not involve time 
and it is constant (Fig. 2). In fact, the ratio X k [t \ X * ) / X k (t | X), 
where * indicates male gender, is 2£ 0 (£)exp(LP*)//l£o(£)exp(LP) 
which taking the logs is simply the difference LP*-LP. Since this 
difference does not involve time, it remains constant and constant 
difference on a log scale implies proportionality on the natural scale. 
This is the main assumption of the Cox’s model. 
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3.4.2 Meaning 
of the Coefficients 
in Cox’s Regression 


3.4.3 Model Checks 


As seen from previous models, differences in logs imply taking the 
exponential to interpret the meaning of the coefficients. The expo¬ 
nentiated coefficient represents the ratio of the hazards or hazard 
ratio between two levels or unit of exposure, exp ((1) =HR. 

Hazards ratios estimate the true risk ratios as the ratio of the 
instantaneous probabilities of event (instantaneous event rates). In 
fact, as previously discussed with rates, HR is an instantaneous RR, 
the limiting value for the RR as time approaches zero. As time 
approaches 0, the risks also approach 0. However, the value of HR 
is different from zero and approaches that of the true RR. In sur¬ 
vival analysis, the incidence rate ratio is the limiting value for the 
RR as time, over which the risks are taken, approaches 0. 

The assumption about linearity of LP, the need to check for lever¬ 
age and influential observations are similar to that previously dis¬ 
cussed. The new important assumption to check is the hazard 
proportionality. When effects change over time it is possible to 
modify the formulation of the model to accommodate the prob¬ 
lem. Design issues related to independence of the observation are 
very important as estimation is based on ML. Finally the study pro¬ 
tocol must guarantee independent censoring and the other key 
requirements for a valid survival analysis. 


4 Extended Models 


Longitudinal studies typically monitor subjects over time and both 
exposure and outcome variables are often measured more than 
once in the same subject. Multiple outcome measurements per¬ 
formed at regular intervals on the same subjects (,longitudinal 
data) are correlated because their values tend to be closer than 
values obtained from different individuals. Also cross-sectional 
measurements repeated in random order in the same individual or 
the assessments of a paired organ such as the eye (repeated mea¬ 
sures ), or single observations on different members of the same 
hospital/region or family (clustered data) are correlated because 
different organs of the same subject and different individuals of the 
same community share biologic experiences, environmental expo¬ 
sures, and genetic background, and therefore are not independent. 
For this reason traditional regression methods are not appropriate 
for the analysis of correlated data (see previous chapter). 

The term “correlated data” refers to the association between 
different measurements of the response. The reader is already 
familiar with other types of correlation in the studied variables, 
such as the association between exposure and disease (which is the 
relationship of interest to the study), or multiple associations 
involving both inputs and output (determining phenomena such 
as confounding or multi-collinearity). Correlated outcome data 
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arise whenever the unit of observation differs from the epidemio¬ 
logical unit of the study and their analysis can be challenging. 
However, correlated data are opportunities rather than problems if 
appropriately analyzed using methods developed to account for 
the within cluster correlation. In fact, the analysis of correlated 
data can help decrease the unexplained variability in the response, 
which is the ultimate goal of any regression analysis. 

Although criteria for model choice are the same as for tradi¬ 
tional regression models, the choice of the specific extended tech¬ 
nique and the corresponding data layout depend on the study 
question and design. For both correlated generalized linear and 
time-to-event data two major analytical approaches are introduced 
here: random effect modeling and variance corrected methods. 
Other methods and special cases are mentioned at the end of the 
chapter. 


4.1 Extended 
Generalized Linear 
Models 

4.1.1 Panel Data Layout 


Data sets for the analysis of longitudinal data contain more obser¬ 
vations (i.e., records) per subject. In these data layouts there are 
often multiple measurements of the response, as well as time inde¬ 
pendent and time dependent inputs if data come from longitudinal 
designs. Time independent inputs are variables that do not change 
with time, such as gender, or variables measured at baseline, such 
as starting body weight or initial blood pressure values. Time 
dependent (or varying) covariates are input variables whose values 
are updated during the study, such as blood pressure or hemoglo¬ 
bin in studies of left ventricular mass. Also follow-up or cross- 
sectional studies may generate correlated data, clustered in centers, 
families, ethnic groups, etc. Correlated data generated by either 
longitudinal or non-longitudinal designs are also referred to as 
“panel data.” 

Multilevel correlations can also be taken into account. For 
example, each unit of a paired organ can be assessed several times 
in the same subject and subjects may be grouped in larger clusters. 
Such panel data would include variables identifying the level (hier¬ 
archy) of correlation. 

A didactic data set is available from a famous reliability study of 
Peak Expiratory Flow Rate [15], measured twice on 17 subjects 
(Fig. 3). In this simple example there are only two rows (observa¬ 
tions) per subject (34 in total), both with the same value of subject 
“identifier” (from 1 to 17); different values for the variable “occa¬ 
sion” (1 or 2); and different values of the response (L/min). 
Recorded measurements on the same subject tend to lie on the 
same side of the overall mean and be closer to each other than those 
taken on different individuals. In diagnostic studies the degree of 
correlation is expected to be very high as between individual dis¬ 
crimination is a necessary condition for a diagnostic test. However, 
even small correlations in outcome studies may induce important 
bias in the estimation process of both the model coefficients (effects) 
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Fig. 3 Between (B) vs. within (W) subject correlation: Values of Peak Expiratory Flow Rate (L/min) measured 
in two occasions in the same subject are correlated. In fact they tend to lie on the same side of the overall 
mean and be closer to each other than those taken on different individuals. LP linear predictor (overall mean) 


and their standard errors (statistical testing and Confidence 
Intervals) if not taken into account. 

4.1.2 Modeling Random To understand the philosophy of this approach, it is useful to think 

Effects of different possible components of the overall variability of the 

response. Regression methods tend to minimize the residual vari¬ 
ance assigning the most likely values to the model coefficients esti¬ 
mating how the response varies as the inputs change. These are 
called fixed effects associated with fixed factors or continuous 
inputs whose levels of interest are actually measured or measurable. 
Fixed effects are unknown constant population parameters describ¬ 
ing the input-output relationship of interest. Conversely, random 
classification variables are inputs whose levels are ‘‘randomly sam¬ 
pled from a population of levels” (e.g., individuals A, B, C; Drs A, 
B, C; hospitals: A, B, C, and so on). Random effects are unob¬ 
served random changes of the response by levels of these random 
factors. In other words, they are deviations from the relationship 
described by fixed factors. To distinguish between random and 
fixed factors, it is useful to answer the following question: “Were 
the study repeated would the same groups/levels be used again?” 
If yes (e.g., gender, treatment A vs B, age groups), it implies fixed 
effects. If not (e.g., centers, regions, subjects), it implies random 
effects. However, the same variable (e.g., centre) may be treated as 
a random variable or as a fixed factor, depending on the objective 
of the study (e.g., to assess a specific centre effect). 
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Fig. 4 Variance components: The top curve represents the distribution of a hypothetical response variable 
(such as the Peak Expiratory Flow Rate—Fig. 3). The observed values are equal to the linear predictor (LP) plus 
an error term (e). The latter includes two components: the variability due to the effect of subject “y” (random 
effect, Q equal to the difference between the mean pij of the values yij, measured on y, and the LP); and the 
variability due to measurement on occasion “/” (effect of occasion nested in subject, e/y equal to the difference 
between pij and yij). Usually it is assumed that both these components are normally distributed with mean zero 
and some non-zero variance (here indicated by i^and 0). The intra-class correlation coefficient p is the propor¬ 
tion of the total variance explained by the random effect. With permission from Ravani et al. [40] 


Figure 4 shows the main variance components of the data in 
Fig. 3: the overall variance of the response includes a variability 
component due to subjects and another component due to mea¬ 
surement. This is true for both the unconditional response without 
predictors and the conditional response given the inputs. The vari¬ 
ability due to subject is thought to be shared within individual but 
to vary across them. This random effect is assumed to follow a 
specified distribution (usually normal, with zero mean and some 
non-zero variance). The variability due to measurement can be 
estimated when more than one measurement is performed in the 
same subject, although it exists independent of the number of 
measurements performed. A random effect model estimates both 
these variance components. When the variance of the random 
effect is significantly different from zero, the null hypothesis of 
absence of correlation is rejected. The ratio between the variance 
of the random effect and the total variance is called intra-class cor¬ 
relation coefficient and represents the proportion of the total vari¬ 
ability in the response due to subjects. 
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4.1.3 Correcting 
the Model Variance 


Random effect models are also called mixed effect models as 
they usually estimate both fixed and random effects. The family of 
generalized linear models depending on the specified link function 
includes linear, logistic, and Poisson mixed models, for example. 
All mixed models require specification of the level of correlation in 
the data (variance-covariance structure) in addition to the link 
function. The random effect can affect either or both the intercept 
and the slope of the curve defining the input-output relationship. 

Another possible approach to the analysis of correlated data is 
based on defining the correlation structure in the data and correct¬ 
ing the model variance. For example, in a cross-sectional study 
where three measurements are made in the same cluster it may be 
reasonable to assume that any two responses within a cluster have 
the same correlation. In this case as there is only one correlation 
parameter p the underlying correlation structure is referred to as 
“exchangeable.” Conversely in the “unstructured” correlation 
structure there are as many p parameters as there are paired combi¬ 
nations of n measurements, nx(n- 1)/2. In a longitudinal study it 
may be assumed that the correlation depends on the interval of 
time between responses, being greater for responses that occur 1 
month apart rather than 20 months apart (Fig. 5). 

Which structure best describes the relationship between corre¬ 
lations is not always obvious although design issues may help decide. 
Despite the coefficient estimates are affected by the correct model 
choice (link function, covariates to include and their possible trans¬ 
formations) and by a sufficiently large number of clusters (e.g., 
ideally greater than 40 and not less than 20), the choice of the cor¬ 
relation structure is also important because the correlation matrix 
enters in the estimation process of the variance of the coefficients. 
Furthermore, the standard model variance estimators are consis¬ 
tent (converge to the true variance value) only if the correlation 
structure is correctly specified. For this reason a special variance 
estimator is used to estimate the standard errors of the coefficients 
when data are correlated. This method corrects the variance incor¬ 
porating the dependencies in the process of computations by 
removing one cluster at a time, and providing an honest estimate 
for correlated data whenever the observations left out at any step 
are independent of the observations left in. Standard errors are 
usually larger than the corresponding naive standard errors, 
depending on the sign of the correlation in the data (usually posi¬ 
tive). Put simply, the variance of the coefficient estimates is cor¬ 
rected for the correlation in the data and the statistical testing is 
more conservative (the confidence intervals are larger) as com¬ 
pared to a standard procedure applied to the same data as though 
each observation was independent. This empirical method is called 
robust because the variance estimation is consistent, even if the 
chosen correlation structure is incorrect (robust to misspecifica- 
tions) [16, 17]. 
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Fig. 5 Examples of correlation structures: Each panel represents a correlation matrix between any two of four 
possible measurements in the same cluster (e.g., taken at time 1,2,3,4). Each symmetric matrix has a value 
of 1 along the main diagonal and some non-1 value off the diagonal. In the absence of correlation (independent 
errors) the correlation structure is independent (identity matrix). The correlation structure is exchangeable if 
there is only one parameter p for any pair of measurements; unstructured if there are nx (n- 1)/2 different 
parameters p; autoregressive (AR1) if there is only one p parameter raised to a power of the absolute differ¬ 
ence between the times of the response; stationary m-dependent if the p parameter is the same for k= 1,2, 
..., m occasions apart and zero for more than m occasion apart (here /r?=2); fixed if specific values are 
assigned to the p parameters. With permission from Ravani et al. [40] 


Generalized Estimating Equations are regression techniques 
based on specification of the correlation structure, use of an empir¬ 
ical (robust or corrected) variance estimator and freedom from dis¬ 
tributional assumption about possible effect of the correlation. 
Also in this case one of the link functions of the generalized linear 
family must be specified. In absence of correlation in the data 
(independent correlation structure) Generalized Estimating 
Equations coincide with the corresponding Generalized Linear 
Model. In the presence of correlated data the variance-covariance 
of the coefficients is estimated based on the working correlation 
structure, the conditional mean, and a scale parameter to account 
for the over-dispersion (extra-variability). If the chosen correlation 
structure is correct there is no need for robust standard errors. The 
robust method is a nonparametric estimate that does not assume 
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4.1.4 Model Choice 


4.2 Extended 
Survival Models 


the correlation structure is correct [17, 18]. Another method to 
estimate the standard errors without making distributional assump¬ 
tion is bootstrapping. In bootstrapping the sample is resampled 
with replacement a certain (desired) number of times (to approxi¬ 
mate what would happen if the population were resampled). Then 
model coefficients and measures of variance are estimated to obtain 
a sample of estimates, from which empirical variance and standard 
errors are obtained. 

The choice of the analytical tool for correlated generalized linear 
data can be guided by different considerations. As opposed to 
Random Effects Models, Generalized Estimating Equations are 
based on only one level of clustering, are not designed for infer¬ 
ences about the covariance structure (the working correlation 
structure is formulated with no distributional assumptions), and 
do not give predicted values for each cluster. Using random factors 
involves making extra assumptions, but gives more efficient esti¬ 
mates and allows estimating contributions to variability from 
different sources. 

Generalized Estimating Equations are marginal models as they 
assume a model holding over all clusters (population average). 
Therefore, the coefficients represent average change in the response 
over the entire population for a unit change in the predictor. 
Random Effects Models are conditional models in that they assume 
a model specific to each cluster (subject specific). Therefore, the 
coefficients represent the average change in the response for each 
cluster given a unit change in the predictor. Although population 
effects can be derived averaging cluster effects, conditional models 
are most useful when the objective is to make inferences about 
individuals rather than the population. 

Correlation in the occurrence and timing of repeated events may 
occur when individuals experiencing a single event belong to 
groups or clusters, or where the subject experiences some event 
more than once due to a recurrent event process [19]. The correla¬ 
tion in the survival times may result from differences in the general 
tendency to fail across individuals and varying tendency to fail fur¬ 
ther once the recurrence process has started (Fig. 6). Heterogeneity 
across subjects (unsharedfrailty) may be due to unknown, unmea¬ 
sured, or unmeasurable effects (different lifestyles, genetic traits, 
environmental factors, and experiences), which influence the likeli¬ 
hood to succumb to disease. As a result, some individuals are more 
(and others less) prone to disease, experiencing their first, second, 
third, etc., recurrent episode more (less) quickly than others. Event 
dependence within a subject emerges when the threshold of further 
events changes once previous events have occurred (e.g., the base¬ 
line risk of thrombosis of the second and third bypass graft is 
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Fig. 6 Sources of correlation within multiple failure-time data. Unknown (or unmeasured) factors can be 
responsible for heterogeneity across individuals (with consequent different A 0 (tj, i.e., baseline risk across 
subjects) and within subject dependence of the failure events (A 0 (tj varying within the subject during the recur¬ 
rent process). With permission from Ravani et al. [40] 


progressively higher or lower than that of the first). Further events 
become more or less likely according to whether the process 
induces a biological weakening or strengthening of the organism 
and whether the subject is more or less frail {shared frailty). In 
either case the risk for an event is a function of previous occur¬ 
rences. Medical research and clinical experience suggest that both 
individual unshared tendencies and varying shared susceptibility to 
fail during the recurrent process are likely to be the rule, rather 
than the exception, in the study of multiple events and that each 
may enhance the effect of the other [19]. 

This correlation among events violates the assumption that the 
timing of events is independent and has two important conse¬ 
quences: the estimates of the coefficients and their standard errors 
are both biased (wrong) and inefficient (imprecise) in typical 
repeated events contexts. Variations of the Cox model (and other 
models), namely, frailty or random effects models and variance- 
corrected methods, have been proposed to account for the correla¬ 
tion among event times. 
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4.2.1 Risk Sets 
for Survival Analysis 


Unordered Events 


Data layouts for survival analysis are more complex as they define 
the risk set based on the three components of the response variable 
(time start, time stop, and censor status) and possible distinction of 
basal risk categories. To define a risk set appropriately, failure events 
should be first classified according to whether they have a natural 
order and whether they are recurrences of the same or different 
event type [19]. 

Unordered events of the same type include lesions studied in paired 
organs such as the eye [20]. For these data a Marginal Unstratified 
Risk Set is appropriate. Marginal models measure each observation 
time from subject enrollment. Events of different type include 
diverse adverse reactions to therapy in an intervention trial, or ure¬ 
mia and mortality in a follow-up study of chronic kidney disease 
patients [3]. These events are unordered because they occur in 
random sequence and, in the absence of correlated data and depen¬ 
dent censoring the Competing Risk Model of Lunn-MeNeil has 
been suggested for analysis [21]. In this case the likelihood of 
being censored at time t does not depend on the reason for censor¬ 
ing including failure from a competing risk. The competing risk 
model is stratified by event type (basal risk allowed to differ) and 
gives the same results as the combined end point analysis (time to 
the first event that occurs). The number of observations per sub¬ 
ject is a multiple of the number of considered events, all censored 
and of the same duration if no event occurred or all censored but 
one if any event occurred. The advantage of the larger data set is 
that it allows for easy estimation of within-event-type coefficients 
(stratum specific effects) and the analysis does not taken into 
account any correlation, as each subject may have at most one 
event [19]. When there are reasons to believe that the data are cor¬ 
related, it is possible to analyze multiple events per subject using 
the Marginal Model of Wei-Lin-Weissfeld [22]. As in the previous 
model all times are measured from the date of patients’ enrolment 
(time zero) but each observation continues in each stratum beyond 
the first event that occurred. An important characteristic of these 
failure events is that each can occur only once per subject and that 
all subjects are at risk for all events (if there are k possible events, 
each subject will appear k times in the dataset, once for each pos¬ 
sible failure). This model may be appropriate when the predictors 
under investigation are plausibly involved in the pathways leading 
to more than one event type and, therefore, the censoring mecha¬ 
nism for one event may be informative for the other. For example, 
plasma levels of asymmetrical di-methyl-arginine (ADMA) have 
been shown to predict both progression of chronic nephropathies 
and death in patients with chronic kidney disease [3]. In these situ¬ 
ations, the terminating time for observing one event could be cor¬ 
related with the other (as it happened in that study) and, as a result, 
the assumption of independent censoring may be violated. 
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Ordered Events 


Furthermore, considering only time to the first event that occurs 
reduces the study power. Robust variance is required in this model 
to account for the occurrence of multiple events in the same 
subject. 

In end-stage renal disease, catheter infections or dysfunctions, 
repeated peritonitis or transplant rejection episodes are ordered 
events in that they may be seen in a study that records the time to 
first, second, third event, and so on, and the subject is not at risk 
for further events until a prior one has occurred. Four layout 
options are available for ordered recurrences. 

In the Counting Process each subject becomes a multi-event 
counting process since the total follow-up time of the subject is 
broken into event defined segments with as many records per indi¬ 
vidual as there are events plus one if the observation continues 
after the last event [23]. This model is not stratified, and estimates 
may be biased if the baseline risk changes during the recurrent 
event process [19]. 

In the Marginal Risk Set model the layout is identical to the 
unordered competing risk model [22]. In essence the model is 
stratified by event number but actually ignores the ordering of 
events (e.g., a person would be at risk for the fourth infection epi¬ 
sode before the first even occurred) and just treats each failure 
occurrence as a separate process. In agreement with simulation 
studies the marginal formulation provides a larger estimated effect 
probably due to the lack of any order implication and the organiza¬ 
tion of the risk set [19]. However, this model may be useful to 
model the total time to each of the possible recurrent events, 
allowing basal risks to differ but with no strict order assumption. 

The assumption of the Conditional Risk Set model is that each 
patient is not at risk for a further event until a prior has occurred 
[24]. Two variations with different time scales and risk sets have 
been implemented and both stratify the data by event number so 
that the baseline hazard is allowed to vary with each event. In the 
conditional risk set model from entry (elapsed time) the data is set 
up as for the counting process (r measured from entry). This varia¬ 
tion is useful when modeling the full time course of the recurrent 
event process. In the conditional risk set model from previous event 
(gap time) the clock is reset at each event (t from previous event 
with zero time at the beginning of each follow-up segment). This 
variation is useful to model the gap time between events. Both 
models are stratified by failure order to track the event number and 
the structure of the data set reflects this sequence or ordering 
assumption (conditional risk). However, elapsed time estimation 
produces the hazard of an event since the study began, while the 
gap time formulation gives the hazard since the previous event. 
The choice of gap versus elapsed time depends on the research 
question at hand. Using gap time presumes there are substantive 
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Time Dependent Effects 
and Time Varying 
Covariates 


4.2.2 Variance 
Corrected Models 


reasons to believe that the “clock should restart” after each event 
in order to determine the effect of the covariates on subsequent 
events. In this case the estimated effects mirror how the covariates 
affect the risk of failure for each observation (e.g., risk of infection 
for each catheter). In contrast, elapsed time models assess the effect 
of the covariates on the risk of failure from the start of the study 
through the end. In such a case the estimated effects reflect how 
the covariates affect the risk of failure over the entire course of the 
recurrent event process (e.g., risk of recurrent infections). 

Another issue to consider when defining a risk set is related to the 
values and effects of the input variables. The term time dependent 
is more appropriately used to define the effect associated with an 
input and the term time varying is used for a covariate with updated 
values over time. For example, an input variable measured at base¬ 
line can have different effects during different follow-up periods 
that can be modeled as a step-function of time. In a follow-up 
study, baseline values of renal function were associated with 
increased risk of death only during the first year of observation and 
not thereafter [3]. These estimated time dependent effects must 
satisfy the proportionality assumption when using the Cox’s model. 
Conversely a variable measured only once (at baseline) may inter¬ 
act with time and thus have an effect that changes with time, as was 
found for serum albumin in the HEMO study [25]. By definition, 
this effect will not satisfy the proportionality assumption. Another 
possibility is that the risk set contains updated values of a variable. 
For example, in a study of Urotensin II (a vasoactive substance) in 
chronic kidney disease patients, end-stage renal disease status (not 
yet on dialysis vs. already on dialysis) had a different effect on car¬ 
diovascular events [26]. This input variable was treated as a time 
varying covariate as subject could change their status during 
follow-up. These input-specific effects must also satisfy the propor¬ 
tionality assumption. 

Variations within the family of variance-corrected models are based 
on different definitions of the risk sets previously described including 
whether they allow for event-specific baseline hazards using stratifica¬ 
tion. In these models (Marginal, Counting Process, and Conditional 
Risk Sets) a robust (cluster) variance estimator is used, as previously 
described for extended generalized linear models, which incorporates 
the dependencies in the process of computations. 

Variance-corrected models represent one way to deal with the 
problems produced by heterogeneity across individuals and failure¬ 
time dependencies. However, since variance-corrected models do 
not incorporate any (random) effect into the estimates themselves 
(the effects are not adjusted for the heterogeneity), these may still 
remain biased. In other words, although providing corrected con¬ 
fidence intervals around the point estimates, these themselves may 
still be positively or negatively biased (Fig. 7). 


Longitudinal Studies 3: Data Modeling Using Standard Regression Models and Extensions 


121 


[1] Point 

[2] Null estimate 

effect 


-* 

-♦ 

♦- 

- ? 


-0.25 0.0 0.25 0.5 0.75 

Regression coefficient scale 

Fig. 7 Plot of the effect (/?) of a covariate in the Cox’s model (forest plot). In this plot 
both relevance and precision of the estimated effect can be summarized [1], 
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intervals (CIs) and null effect (0 on the p scale; 1 on HR scale): CIs including 0 
(HR=1) imply lack of significance due to greater variance relative to the effect 
(imprecision). Correlation in the data due to recurrent events may affect both 
chance error (variance) and systematic error (point estimate). Reducing bias (sys¬ 
tematic error) makes the point estimate closer to the true value in the population 
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4.2.3 Frailty Models In contrast to the variance-corrected models, frailty models do 

incorporate heterogeneity into the estimated portion of the model 
by making assumptions about its distribution [27-32]. This latent 
random effect varies across individuals but it is assumed to be con¬ 
stant over time and shared by a single individual (or all members of 
a cluster). As a result, under frailty models the event times are 
assumed to be independent conditional on the patient’s underlying 
frailty and inference can be made in the standard fashion. Standard 
packages for frailty models estimate the variance of this random 
effect. When this variance is significantly different from zero, the 
model supports the hypothesis of a significant heterogeneity in the 
data based on the shared frailty. Frailty models have been shown to 
produce unbiased estimates of covariate effects in simulation stud¬ 
ies with known variance of the random effect (heterogeneity) and 
in absence of event dependence [19]. In these situations frailty 
models have been shown to perform better than variance corrected 
models. However, the baseline hazard rate for the standard frailty 
model is assumed to be the same across events (the traditional 
frailty model has the same risk set as the counting process). This 
has been viewed as a limitation in presence of event dependence, 
which is controlled instead by stratified variance-corrected meth¬ 
ods, and therefore, these may be preferred in presence of event 
dependence without heterogeneity. Since repeated events processes 
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are usually characterized by both event dependence and heteroge¬ 
neity (or it is often unclear which feature of the data mostly under¬ 
lies the correlation) a stratified frailty model has been proposed 
with the same risk set as the gap time risk set [28]. This conditional 
frailty model combines estimation of the unobserved heterogene¬ 
ity incorporated as random effect with event-based stratification 
(varying baseline hazards) to control for event dependencies. 

4.2.4 Model Choice The choice of the analytical tool is dictated by the type and order 

of the failure events and the clinical question to be answered. 

For multiple events of different type the marginal model is 
often the best choice to avoid violation of the uninformative cen¬ 
soring condition. This is true when the model includes factors 
plausibly involved in the mechanism of more than one event type. 
Frailty models can be used to specify and account for the sources 
of correlation in the data [27-32]. 

For ordered recurrent events of the same type there are more 
choices, though most often the order condition and the difference 
in the baseline risks are important issues to be accounted for. The 
counting process is useful if there is no reason to believe that the 
baseline risk varies. The marginal risk model may be more appro¬ 
priate to model repeated hospitalizations (where the reason for 
hospitalization has no natural order) than repeated bypass graft 
thrombosis or peritonitis episodes. Conversely, when the clinical 
course of repeated events supports the conditional assumption, we 
can either model the entire time course of the disease (from entry) 
or model the time segments between failures (from previous 
event). However, variance corrected methods may still provide 
biased results in presence of heterogeneity since they do not incor¬ 
porate any random effect in the model. 

Heterogeneity and event dependence can be considered com¬ 
ponents of a latent random effect inducing biased estimates if not 
taken into account. Both sources of correlation in the data may 
simultaneously underlie most of the recurrent events processes, 
although one may prevail over the other. In presence of event 
dependence without heterogeneity the true variance of the frailty 
is zero. In these cases stratified variance-corrected methods per¬ 
form well, whereas the traditional (unstratified) frailty model 
detects the presence of a random effect that was probably the con¬ 
sequence of event dependence rather than heterogeneity. In pres¬ 
ence of heterogeneity without event dependence stratification may 
not be necessary since the baseline risk should not change by event 
number. In this case variance corrected models may be inefficient 
and the unconditional frailty model would perform better. Yet, 
since repeated events data are very likely to exhibit both sources of 
correlation, a modeling strategy that is robust to heterogeneity and 
event dependence may be necessary [28]. 
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4.2.5 Competing Risks In previous Subheading (“Unordered Events”) we introduced two 

approaches for the analysis of unordered events of different type, 
including competing events [21, 22]. In these situations the occur¬ 
rence of one event precludes or alters the risk for other events. 
If the likelihood that an observation is censored at time t does not 
depend on the reasons for censoring including failure from a com¬ 
peting risk (non-informative censoring), standard survival analysis 
methods (Kaplan-Meier and Cox regression) and their extensions 
[21, 22] can be used in the usual fashion. In the presence of com¬ 
peting risks however, time to competing events and time to censor¬ 
ing may not be independent because the exposure of interest can 
be associated with one or more competing risks (informative cen¬ 
soring). In these situations an analysis based only on standard 
methods can lead to misleading conclusions. A more comprehen¬ 
sive analytical approach (including standard methods) will help 
examine the extent to which the existence of competing risks inter¬ 
feres with the relationship of interest. 

The reason the usual survival methods should be applied with 
caution in the presence of competing risks can be appreciated from 
the following example. Table 7 shows 12 subjects with kidney dis¬ 
ease followed until end-stage kidney disease (ESRD; N= 6) or 
death from other causes (N= 6). Obviously, at 12 months 50 % 
died and 50 % reached ESRD. However, at 12 months the risk 
(1 - Kaplan-Meier survival probability; 1-KM) of ESRD censor¬ 
ing for death is 78 % and the risk for death censoring for ESRD is 
100 %. The sum of these two risk estimates is >1. Therefore, in the 
presence of competing risks, a Kaplan-Meier risk estimate at t can 
be interpreted as the risk of an event assuming that other types of 
event do not exist or have not happened at t [33]. 

The cumulative incidence function (CIF) method has been 
developed for the analysis of survival data when there are compet¬ 
ing risks. According to this approach, the risk for any event is par¬ 
titioned into the risks for each type of recorded event. Table 7 
shows the key difference in the calculation methods. The cumula¬ 
tive risk (1-KM probability) of ESRD at t= 5 (censoring for 
death), for example, is obtained by adding to the 1 - KM estimate 
at t= 4 (0.167) the product of the conditional risk for ESRD at 
t= 5 (1-1/8 = 0.125) times the cumulative ESRD-free survival at 
t= 4 (0.833). The CIF at t= 5 is the sum of the CIF at t= 4 (0.167) 
and the product of the conditional risk for ESRD at t= 5 
(1-1/8 = 0.125) times the cumulative probability of both-event-free 
survival at t= 4 (0.667). The fact that the conditional risk at t is 
multiplied by a smaller cumulative survival probability at t=t- 1 
explains why the CIF at t is always smaller than the corresponding 
1 - KM function when there are competing risks. This calculation 
method makes sure that the sum of all event-specific CIFs is always 
>0 and <1 (first axiom of probability). Figure 8 shows the CIFs 
and 1-KM probability functions calculated in Table 7. 


Table 7 

Calculation of the cumulative incidence of events according to the Kaplan-Meier method (KM) and 
the cumulative incidence function (CIF). Bold emphasis indicates the event of interest in KM analysis 





Risk of 

ESRD (/?) 

Conditional 
survival ( P) 

Cumulative 

survival 

(KM) 

Complement Any-event- 
of Kaplan- free survival 
Meier (1-KM) (KM2) 

CIF of 
ESRD 

Time 

N 

Event 

/?= ESRD//V 

P=1 -/? 

m=p 

xKM' 

1 - KM = /? 
xKM' + /?' 

CIF = R 

KM2=KM2 xKM2 

x(1-(ANY)/W) +CIF' 

1 

12 

ESRD 

0.083 

0.917 

0.917 

0.083 

0.917 

0.083 

2 

11 

ESRD 

0.091 

0.909 

0.833 

0.167 

0.833 

0.167 

3 

10 

DEATH 

0.000 

1.000 

0.833 

0.167 

0.750 

0.167 

4 

9 

DEATH 

0.000 

1.000 

0.833 

0.167 

0.667 

0.167 

5 

8 

ESRD 

0.125 

0.875 

0.729 

0.271 

0.583 

0.250 

6 

7 

DEATH 

0.000 

1.000 

0.729 

0.271 

0.500 

0.250 

7 

6 

DEATH 

0.000 

1.000 

0.729 

0.271 

0.417 

0.250 

8 

5 

ESRD 

0.200 

0.800 

0.583 

0.417 

0.333 

0.333 

9 

4 

ESRD 

0.250 

0.750 

0.438 

0.563 

0.250 

0.417 

10 

3 

DEATH 

0.000 

1.000 

0.438 

0.563 

0.167 

0.417 

11 

2 

ESRD 

0.500 

0.500 

0.219 

0.781 

0.083 

0.500 

12 

1 

DEATH 

0.000 

1.000 

0.219 

0.781 

0.000 

0.500 





Risk of 
death (fi) 

Conditional 
survival ( P) 

Cumulative 

survival 

(KM) 

Complement 
of Kaplan- 
Meier (1 - KM) 

Any-event- 
free survival 
(KM2) 

CIF of 
DEATH 

Time 

N 

Event 

R= death IN 

P=1-/f 

KM = P 
xKM' 

1 - KM = R 
xKM' + R' 

CIF=/? 

KM2 = KM2 x KM2 
x (1 - (ANY)//V) +CIF' 

1 

12 

ESRD 

0.000 

1.000 

1.000 

0.000 

0.917 

0.000 

2 

11 

ESRD 

0.000 

1.000 

1.000 

0.000 

0.833 

0.000 

3 

10 

DEATH 

0.100 

0.900 

0.900 

0.100 

0.750 

0.083 

4 

9 

DEATH 

0.111 

0.889 

0.800 

0.200 

0.667 

0.167 

5 

8 

ESRD 

0.000 

1.000 

0.800 

0.200 

0.583 

0.167 

6 

7 

DEATH 

0.143 

0.857 

0.686 

0.314 

0.500 

0.250 

7 

6 

DEATH 

0.167 

0.833 

0.571 

0.429 

0.417 

0.333 

8 

5 

ESRD 

0.000 

1.000 

0.571 

0.429 

0.333 

0.333 

9 

4 

ESRD 

0.000 

1.000 

0.571 

0.429 

0.250 

0.333 

10 

3 

DEATH 

0.333 

0.667 

0.381 

0.619 

0.167 

0.417 

11 

2 

ESRD 

0.000 

1.000 

0.381 

0.619 

0.083 

0.417 

12 

1 

DEATH 

1.000 

0.000 

0.000 

1.000 

0.000 

0.500 
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Fig. 8 Cumulative incidence of ESRD (left) and death (right) according to the Kaplan-Meier method censoring 
for the competing risk (dashed line) and the cumulative incidence function accounting for the competing risk 
(solid line) 


While estimating risks ignoring competing risks (e.g., using 
1-KM) is not recommended, because it implies strong assump¬ 
tions in the presence of competing events, testing covariate effects 
or making effect estimation using standard procedures (i.e., log- 
rank test or Cox regression) is useful. Pintilie suggests that survival 
analysis in the presence of competing risks has two main approaches: 
testing “pure effects” by ignoring competing risks and incorporat¬ 
ing competing risks in the estimation process [33]. Comparing 
1-KM (log-rank) or cause-specific hazards (Cox) gives insight 
into the biologic mechanism of the disease and is invariant to the 
size of the competing risks. Censoring for competing events 
assumes that subjects remain at the same risk following the occur¬ 
rence of a competing event. This may or may not be true in a given 
study. Conversely, comparing CIF or sub-hazards does not assume 
independence between event types as subjects experiencing 
competing events are counted as they did not have any chance of 
failing and observed risks or hazard rates are examined [33]. If the 
non-independent censoring assumption is not violated, then cause- 
specific and sub-hazards analyses will provide similar results. 
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The following example illustrates how standard approaches and 
competing risk methods complement each other in the analysis of 
time to event when there are competing risks. 

Pintilie selected a group of 616 early stage Hodgkin lymphoma 
patients from a larger Canadian registry cohort [33]. Since all these 
patients received radiotherapy at a relatively young age (median 
age 30 years) follow-up is long and late effects of radiation have 
been recorded, including secondary malignancy and death from 
other causes unrelated to secondary malignancy. Since the risk of 
malignancy in general increases as a person grows older and radia¬ 
tion can cause malignancies, the study question is whether older 
age (age over 30) is associated with increased risk for malignancy 
secondary to radiotherapy as compared to the younger group 
(30 years or younger). During follow-up 84 subjects experienced 
secondary malignancy, 195 patients died from other causes and 
337 were alive at the study end date (Fig. 9). Since some patients 
experienced both secondary malignancy and death the model of 
Wei, Lin, and Weissfeld [22] would be useful to include multiple 
(correlated) failure times per subject (Subheading “Unordered 
Events”). However, since these patients were few, analysis based on 
the Lunn-McNeil model [21] provided similar results. Both these 
models are dual event models for cause-specific hazards, as the 
standard single event Cox model. 

Figure 10 clearly shows that older individuals die sooner that 
younger individuals as expected, irrespective of whether secondary 
malignancy is treated as competing risk or censored (right plots). 
Conversely, crude analyses of secondary malignancy differ accord¬ 
ing to whether death is treated as competing event or is censored. 
According to the competing risk analysis (top left) age at the time 
of radiotherapy is not associated with greater incidence of second¬ 
ary malignancy (Gray test P= 0.56), while censoring observations 
for death (bottom left) yields different 1 - KM functions (log-rank 
test P= 0.002). The two approaches convey complementary rather 
than conflicting information. Competing risk analysis suggests that 


RX 


censored 


N = 337 




RX 


SM 


N = 84 


* * 



death 


* 
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N = 195 


Fig. 9 Events recorded in a cohort of Hodgkin lymphoma patients following 
radiotherapy 
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Fig. 10 Cumulative incidence functions (CIF; top plots) and 1 - Kaplan-Meier survival functions (1 - KM; bot¬ 
tom plots). The event of interest is secondary malignancy (left plots) or death (right plots). The exposure of 
interest is age >30 years (dashed lines) vs. 30 years or less (continuous lines) 


when death is taken into account secondary malignancy may not 
have a chance to be observed in the older subjects because these 
subjects are also more likely to die. However, analysis of cause - 
specific risks shows that if death did not occur then secondary 
malignancy also would be more likely in the older subjects (log- 
rank). Implications may differ according to the purpose of the 
study and the target stakeholders. From a health policy perspective 
different preventive measures may not be cost-effective after radia¬ 
tion in older subjects because they are more likely to die than to 
develop secondary malignancy. However, if they do not die they 
have higher risk of secondary malignancy, and this finding may 
inform clinical decision-making in a single patient. Table 8 sum¬ 
marizes the cause-specific hazard ratios (from single event Cox 
regression models and dual event models) and the sub-hazard 
ratios (from the Fine and Gray model for competing risk [34] for 
secondary malignancy and death associated with older age versus 
younger age at the time of radiation in this Hodgkin lymphoma 
cohort. Consistent with the data depicted in Fig. 10, the confidence 
intervals of the sub-hazard ratio for secondary malignancy are not 
significant at the two-sided significant level of 0.05. 
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Table 8 

Cause-specific and sub-hazard models of time to secondary malignancy and death 


MODEL 

HR 

95 % Cl Outcome 

Hazard 

Correlation 

Cox (single event model) 

0.51 

0.33, 0.79 Secondary malignancy 

Cause- 

specific 

No 

Fine and gray (single event model) 0.88 

0.58, 1.35 Secondary malignancy 

Sub-hazard 

No 

Cox (single event model) 

0.27 

0.19,0.38 Death without 

secondary 

malignancy 

Cause- 
specific 

No 

Fine and gray (single event model) 0.28 

0.31,0.39 Death without 

secondary 

malignancy 

Sub-hazard 

No 

Lunn-McNeil (dual event model 
disregarding events following 

0.51 

0.33,0.79 secondary malignancy 

Cause- 
specific 

No 

the first event) 

0.27 

0.19,0.37 Death without 

secondary 

malignancy 

Cause- 
specific 

No 

Wei, Lin, and Weissfeld (dual 
event model considering all 

0.51 

0.33, 0.79 secondary malignancy 

Cause- 
specific 

Yes 

events following the first event) 

0.27 

0.20, 0.36 Death with or without 

secondary 

malignancy 

Cause- 
specific 

Yes 


In summary, while crude risk estimation in the presence of 
competing risks requires consideration of the cumulative incidence 
function, both analysis of cause-specific hazards and analysis of 
sub-hazards are useful to compare and interpret covariate effects. 
While cause-specific hazard analysis provides insights into the bio¬ 
logic mechanisms of disease, sub-hazard analysis compares 
observed risks. In the presence of competing risks the use of both 
these complementary approaches conveys complementary infor¬ 
mation and the opportunity to examine the existence of informa¬ 
tive censoring and interpretation of its potential consequences. 

4.3 Special Topics Random effect modeling and variance corrected methods are gen¬ 
eral approaches to model quantitative responses, categorical data, 
counts, and survival times. The advantage of these methods is that 
they are natural and very flexible extensions of standard techniques, 
easily applicable to different circumstances. Special methods exist 
for specific analytical issues. 

Repeated Measures ANOVA is an important model for 
continuous responses and categorical exposures. The method is 
based on partitioning between and within subject variance. For 
example, Dittrich et al. studied residual renal function changes 
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over time in a group of peritoneal dialysis patients receiving 
contrast media and in a similar control group of unexposed sub¬ 
jects [35]. In this study renal function was assessed repeatedly over 
time, time was treated as a within subjects factor, and exposure to 
contrast media as a between subject factor. The effect of the expo¬ 
sure and its interaction with time were studied taken into account 
intra-subject correlation of the data. However, ANOVA assumes 
that differences between any two levels of within-subject factor are 
the same (circular covariance matrix), with constant variance and 
covariance of any pair of within subject measures (compound sym¬ 
metry or exchangeable correlation structure). This assumption is 
rarely met in practice, as adjacent observations tend to be more 
correlated than distant observations. Sphericity testing and adjust¬ 
ment methods for the underlying F statistics, specific randomized 
designs, and problems with missing data make this technique a less 
flexible approach than mixed models. 

Multivariate ANOVA (MANOVA) is another approach to 
repeated measures of continuous data. This model requires that 
the number of subjects minus the number of between-subjects 
treatment levels be greater than the number of dependent variables 
(measurements). MANOVA allows studying the simultaneous 
change of more outcomes (repeated measures) in response to an 
exposure. For example, Van Vilsteren et al. showed beneficial 
effects of an exercise program for dialysis patients on behavioral 
change, physical fitness, physiological conditions and health-related 
quality of life [36]. 

In other longitudinal studies the study objective is the proba¬ 
bility of a state change in a population. Markov Chains’ models are 
used for series of system states, where the state change (transition) 
is studied assuming that a future state is conditionally independent 
of every prior state given the current state. For example, Weijnen 
et al. used a Markov chain model to investigate the impact of 
extended time on peritoneal dialysis using a new dialysis solution, 
assuming lower cost of the technique as compared to standard 
hemodialysis and longer durability as compared to peritoneal dialy¬ 
sis with traditional fluid [37]. Different scenarios were forecast 
over a 10-year period using aggregate data from the End-Stage 
Renal Registry in the Netherlands. 

Time series are other models for longitudinal data. Time series 
study observations at successive (usually evenly spaced) time inter¬ 
vals to describe, quantify, and test theories on trends, seasonality, 
cyclic variation, and irregularities. For example, Espinosa et al. 
used time series analysis to study prevalence trends of Hepatitis C 
virus infection among hemodialysis patients in the Province of 
Cordoba, Spain [38]. 
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Chapter 7 


Longitudinal Studies 4: Matching Strategies 
to Evaluate Risk 

Matthew T. James 

Abstract 

Matching is a strategy that can be used to control for confounding at the design stage of observational 
studies that examine exposure-outcome relationships. In case-control studies, matching can be used to 
generate subsamples of case and control units that are similar with respect to one or more confounders. 
In cohort studies, matching can balance confounder(s) so that they are the same in exposed and unexposed 
groups. Matching methods have been extended to include multivariable approaches, the most common 
being propensity score matching in observation studies of interventions. This chapter describes the major 
principles of matching applied to case-control, cohort, and propensity score studies. Matched study 
designs provide several advantages for controlling confounding in observational studies; however, they 
remain vulnerable to residual confounding and can even introduce bias when implemented incorrectly. 

Key words Matching, Case-control study, Cohort study, Confounding, Selection bias, Efficiency, 
Overmatching, Propensity score matching 


1 Introduction 


In epidemiologic studies, matching refers to the process of selecting 
a reference group that has one or more similar factors to an index 
group. Matching is most often employed at the level of individuals. 
In case-control studies, index cases (members of the group with an 
outcome of interest) are matched with reference controls (mem¬ 
bers of the group without the outcome of interest). Matching may 
also be used in cohort studies, where index subjects in an exposed 
group are matched with one or more unexposed participants. 
Matching may also be performed at the level of groups of subjects, 
a process known as frequency matching [ 1 ]. Frequency matching 
involves selection of a group of reference subjects matched on 
factors equivalent to the group of index subjects. Whether or not 
matching is of cases with controls, exposed with unexposed, or 
individual versus frequency, the process of matching seeks to create 
groups that are identical or reasonably similar in distribution based 
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on one or more factors that potentially confound the relationship 
between an exposure and outcome. 

Matching can be considered a form of a stratified sampling 
design, where the distribution of reference and index subjects 
across strata that are defined by the matching factor, are forced to 
follow a specific distribution. In case-control studies, when con¬ 
trols are matched to cases, the distribution of cases specifies the 
distribution. In cohort studies, where the unexposed are matched 
to the exposed, it is the distribution of the exposed that determines 
the distribution. Matching is a common strategy in observational 
studies designed to determine risk, and represents the most com¬ 
mon form of stratified sampling in epidemiologic research. 

Matching is often described as a strategy to reduce confound¬ 
ing in observational studies. This is true in cohort studies where 
matching alters the distribution of the matching factors in the 
study population from which all cases arise, such that the distribu¬ 
tions of these factors are made to be similar in exposed and unex¬ 
posed members of the cohort. Matching of unexposed to exposed 
thus serves to prevent any association between exposure and the 
matching factor and thereby controls for confounding in cohort 
studies. In contrast, although matching in case-control studies 
increases the efficiency of confounding control, failure to account 
for the matching factors in later stages of a case-control study can 
introduce bias. As explained further in this chapter, if matching 
factor(s) in a case-control study are confounders, it is important to 
account for the matching factor or factors in the analysis stage in 
order to avoid introducing selection bias. 


2 Matching in Case-Control Studies 

The purpose of matching in case-control studies is to select a con¬ 
trol group that will provide an estimate of the distribution of an 
exposure of interest in the source population (Fig. la). Matching 
selects controls that, with respect to the matching factor, are iden¬ 
tical or similar to the identified cases. 


2.1 Potential 
for Selection Bias 
in Matched Case- 
Control Studies 


In case-control studies, when the controls are matched with cases 
on a factor that is also associated with an exposure, the frequency 
of that exposure is forced to be similar to that of the cases. Thus if 
a matching factor is perfectly correlated with an exposure of inter¬ 
est, the distribution of this exposure in controls will be forced to be 
identical to that of the cases, and no association will be detected 
between the outcome and exposure (i.e., the result will be an odds 
ratio of 1.0). Regardless of whether the exposure is positively or 
negatively correlated with the matching factor, any association 
between the exposure and matching factor will result in an expo¬ 
sure distribution among controls that is biased towards a similar 
distribution to that of cases. 
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a 


EXPOSURE 


OUTCOME 


Compare 

prior 

exposure(s) 

between 

matched 

groups. 



Source 

Population 


Matching factor(s) 




Compare 

subsequent 

outcome(s) 

between 

matched 

groups 


Fig. 1 Use of matching in case-control (a) and cohort (b) studies, (a) Cases (who experienced an outcome) are 
drawn from a source population, and controls who did not experience the outcome are selected from the 
source population on the basis of similar matching factor(s). Prior exposures are compared between cases and 
controls in the matched sample, (b) Exposed subjects are drawn from a source population, and unexposed 
subjects are selected from the source population with similar matching factors. Matched exposed and unex¬ 
posed subjects are followed for subsequent outcomes 


This explains how the process of matching in case-control 
studies can introduce selection bias [2\. If the matching factors are 
confounders in the source population, the matching process in a 
case-control study can superimpose a selection bias, in place of the 
confounding, that will bias results towards the null, regardless of 
the nature of confounding in the source population. For this reason, 
although matching is often intended to control for confounding, 
matching alone does not remove the potential for bias in a case- 
control study. However, the selection bias introduced through 
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22 Advantages 
of Matching in Case- 
Control Studies 


matching in case-control studies can be controlled for by 
appropriately accounting for the matching factor as a confounder in 
subsequent steps of analysis. This can be achieved through stratifica¬ 
tion on the matching factors, or inclusion of the matching factor as 
an independent variable in subsequent regression modeling. For 
example, if cases and controls have been matched on sex and age 
categories, any stratification or regression adjustment for age and sex 
categories that is as fine, or finer, than the original matching criteria 
is required to remove the selection bias introduced by matching. 

Although matching in case-control studies does not by itself 
prevent confounding, one of its main advantages arises from the 
efficiency gained for the control of confounding [3]. For example, 
when the distribution of confounders substantially differs from the 
distribution in the overall source population, there may be, in the 
absence of matching, some strata with many cases and few controls 
and others with few cases and many controls. When controls are 
matched to cases, the ratio of controls to cases becomes constant 
across the strata of the matching factor. Matching forces the con¬ 
trols to have the same distribution of matching factors across strata 
as the cases and hence prevents extreme departures from the opti¬ 
mal distribution among controls. Matching thereby often improves 
study efficiency by minimizing the variance of subsequent estimates 
from stratified analysis or regression methods. Although matching 
on a factor that is a confounder will more often lead to an improve¬ 
ment in efficiency compared to unmatched studies, case-control 
matching on a variable that is not a confounder will usually harm 
efficiency. 

When the process of measuring an exposure or confounder is 
difficult or expensive, matching can serve the purpose of optimiz¬ 
ing the amount of information obtained for each subject included 
in the study [4]. This becomes particularly relevant when expo¬ 
sures under study include biological samples which are challenging 
to obtain or expensive to measure. In these cases it is more efficient 
to optimize the amount of information from each subject than to 
increase the number of subjects in the study. Matching on a con¬ 
founding factor can allow for control of confounding in the analy¬ 
sis, while making maximum use of exposure information that is 
difficult or expensive to obtain. 

In some situations matching is necessary to efficiently control 
for confounding. This may occur with sparse data that occurs with 
nominal variables with many categories (e.g., sibship, neighbor¬ 
hood, or care provider) [ 1 ]. These variables are often characterized 
by small numbers of potential subjects within each category of 
the nominal variable. While many subjects may be eligible for the 
study, subjects in any given category may have low probability of 
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2.3 Disadvantages 
of Matching in Case- 
Controi Studies 


being included in an unmatched sample, and most strata in a 
stratified analysis may include only one subject, either a case or 
control, which would supply no information about the effect under 
study. Matching can ensure that after stratification by the poten¬ 
tially confounding factor, each case has at least one matched con¬ 
trol for comparison. Matching on these types of nominal variables 
is necessary to obtain an unconfounded and reasonably precise 
estimate of effect. 

There are costs that come with matching. When a factor is used for 
matching, the true relationship between the factor and disease is 
disrupted, and it is no longer possible to estimate the association of 
that factor with the outcome. Although it is no longer possible to 
study the relationship of the exposure to outcome or evaluate its 
strength as a confounder in such a scenario, it is still possible to 
evaluate the factor as an effect modifier. 

Matching can also increase the expense required in the process 
of identifying appropriate control subjects with the same distribu¬ 
tion of matching factors found among cases [1, 4]. For example, 
many potential control subjects may need to be examined to identify 
one with the same set of matching factors as a case. Thus, if effi¬ 
ciency is judged by the amount of effort or cost required to obtain 
information for each subject included in a study, matching can 
decrease the efficiency of the study when the effort required to find 
matched subjects exceeds the effort that would be required to 
gather information on a large number of unmatched subjects [1]. 

Overmatching is another potential disadvantage of matching [5]. 
One form of overmatching refers to matching that reduces statisti¬ 
cal efficiency. This can occur when cases and controls are matched 
on a variable that is not a confounder. In this situation, matching 
was not necessary to control for confounding, but in order to 
avoid introducing selection bias, further stratification or regression 
adjustment becomes necessary in subsequent steps of analysis of 
the matched sample, resulting in a loss of information. The ineffi¬ 
ciency introduced in such a scenario is proportional to the strength 
of correlation between the matching factor and the exposure. 
To avoid this form of overmatching, matching is best done on a 
confounder that is strongly associated with the disease and has 
some degree of association with exposure. 

A second form of overmatching refers to matching on a variable 
that is an intermediary between exposure and disease. If matching 
is performed on a factor that is affected by the study exposure, the 
exposure prevalence among non-cases will be shifted toward that 
of the cases, thereby biasing the estimate of association toward the 
null. The bias introduced by this process will not be corrected by 
subsequent stratification or regression on the matching factor. 
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3 Matching in Cohort Studies 

The purpose of matching in cohort studies is to select an unex¬ 
posed group that, with respect to matching factor(s), are identical 
or similar to an exposed group (Fig. lb). Matching in a cohort 
study prevents an association between exposure and the matching 
factor among the study subjects at the start of follow-up, so the 
matching factor will no longer act as a confounder of the exposure- 
outcome relationship. Matching is less often performed in cohort 
studies than in case-control studies due to the expense of trying 
to identify suitable unexposed subjects in large cohorts. In cohort 
studies, it is often more efficient to follow an unmatched group 
of exposed and unexposed subjects, rather than to use resources to 
identify unexposed subjects with a similar distribution of matching 
factors to exposed subjects. 

Although matching in cohort studies can control for con¬ 
founding due to the matching factor(s), it does not necessarily 
eliminate the need for control of the matching factors. This is 
because matching prevents the associations between an exposure 
and matching factor among participants at the start of a cohort 
study, but matching may not control for subsequent effects of 
associations between exposure and matching factors that occur 
during later follow-up time [1]. If an exposure and matching 
factors affect competing risks and loss to follow-up, the original 
balance produced by matching will not be maintained across the 
person time available for analysis. In these cases, additional control 
of matching factors may be necessary to obtain valid estimates, 
despite the use of matching at cohort entry. 


4 Matching Using a Propensity Score 

Matching methods have been extended from matching on one or 
several covariates individually, to multivariate matching. Matching 
on multiple covariates between exposed and unexposed groups in 
cohort studies can be challenging because it becomes more difficult 
to find matches with close or identical distributions of all matching 
factors, as the number of matching factors increases. Rather than 
attempting to match on all of the covariates individually, discrimi¬ 
nant matching can be used to overcome this problem. Discriminant 
matching refers to matching on a scalar function of multiple covari¬ 
ates, which can be used to obtain overall balance of all of the cova¬ 
riates between exposed and unexposed groups. Propensity score 
methods, developed by Rosenbaum and Rubin, use this approach 
to control for multiple covariates simultaneously [6]. 

A propensity score is an individual’s probability of being sub¬ 
jected to an exposure of interest, conditional on a set of observed 
covariates [6]. A propensity score thus reduces a collection of 
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Fig. 2 Overlap of propensity score distributions among exposed and unexposed subjects in the full cohort before 
matching (a) and in the propensity score-matched sample (b). In this example, within the full cohort (a) a larger 
proportion of subjects in the unexposed groups had low propensity scores, while a larger proportion of subjects 
in the exposed group had high propensity score. After matching on the propensity score, the distributions of the 
propensity scores are similar between the exposed and unexposed groups 


multiple covariates into a single scalar score that can subsequently 
be used for matching. A propensity score can be defined as the 
probability of being exposed, conditional on a set of measured 
baseline covariates. An important characteristic of a propensity 
score arises from the fact that if exposed and unexposed groups 
have the same distribution of a propensity score, they will also have 
the same distribution of the observed covariates that were used to 
generate this propensity score [6]. Thus rather than requiring close 
or exact matches on all covariates, matching on a propensity score 
enables the construction of matched sets of exposed and unexposed 
groups with similar distributions of covariates (Fig. 2). Propensity 
score matching is thus used to construct matched sets that are 
balanced on a large number of covariates. 

Propensity score matching is often used in cohort studies 
where the exposure of interest is an intervention or treatment [ 7]. 
Such studies of an intervention are vulnerable to confounding due 
to treatment-selection bias, in which individuals who are selected 
to receive treatment may have significantly different characteristics 
than those who are not treated. These preexisting differences 
between treated and untreated groups need to be controlled to 
obtain unbiased estimates of the association between treatment 
and the outcome(s) of interest. 

4.1 Advantages 
of Propensity Score 
Matching 


Matching on a propensity score selects exposed and unexposed 
groups with a similar distribution of covariates that predict expo¬ 
sure. The result is that the propensity score-matched subsample 
of exposed and unexposed subjects are “balanced” with respect to 
their observed covariate distributions, meaning that they are the 
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4.2 Limitations 
of Propensity Score- 
Matched Studies 


same in the both groups. This reduces bias due to measured 
confounders. Furthermore, because the matching process selects 
unexposed subjects who are most similar to the exposed subjects, 
the subjects excluded from the matched cohort are those most 
irrelevant for comparison with exposed individuals [6]. When there 
are large differences in the covariate distributions between exposed 
and unexposed groups, standard regression model-based adjust¬ 
ments may be vulnerable to unreasonable extrapolation of model- 
based assumptions. Matching methods make these differences 
explicit and avoid sensitivity to untestable model assumptions [8,9]. 

When selecting the exposed and unexposed comparison groups 
by matching on the propensity score, no knowledge about subse¬ 
quent outcomes of the study is required. Only in subsequent steps 
of analysis is the relationship between exposure and outcome exam¬ 
ined by estimating exposure effects within the matched sample. The 
separation of these steps may prevent intentional or unintentional 
bias that might otherwise occur when decisions about exposure 
and covariate selection occur simultaneously with analysis of their 
association with outcome [10]. 

Theoretical and empirical research has demonstrated that pro¬ 
pensity score matching methods have advantages to other methods 
for use of propensity scores, including stratification on a propensity 
score or covariate adjustment using a propensity score. In prior 
work, matching on a propensity score was shown to eliminate a 
greater degree of the systematic differences between exposed and 
unexposed subjects than did stratification on the propensity score 
[11, 12]. Also, with propensity score matching, adequate balance 
of covariates between the exposed and unexposed subjects in the 
propensity matched sample can be more easily assessed than when 
using methods of stratification or covariate adjustment using the 
propensity score. A propensity score-matched study design also facil¬ 
itates calculation of measures of exposure-outcome relationships 
such as absolute risk differences or the number needed to treat, 
analogous to measures of effect reported in randomized trials [8]. 

Although propensity score matching can produce a high degree 
of balance of covariates between exposed and unexposed groups, it 
may not achieve balance of variables that were not measured or 
included in the propensity score. Such unmeasured variables that 
are not part of the propensity score may remain unbalanced between 
cohorts, which may result in residual confounding due to these 
unbalanced variables and lead to biased estimation of exposure- 
outcome relationships. 

Although propensity score methods, including propensity 
score matching, have been proposed to address confounding by 
indication in observational studies examining treatment effects and 
have several theoretical benefits, there is little empirical evidence 
that they achieve better control of confounding than conventional 


Longitudinal Studies 4: Matching Strategies to Evaluate Risk 


141 


multivariable regression modeling of exposure-outcome relationships. 
Although some work has suggested significant differences in treat¬ 
ment effects when estimated using propensity score method versus 
traditional multivariable regression approaches, the majority of 
published observational studies have reported similar results using 
both techniques to adjust for confounding [13-16]. It is not clear 
whether this is due to inherent properties of propensity score 
methods, or because studies did not implement propensity 
score methods properly [12]. 


5 Steps in Performing a Matched Study 


5.1 Deriving 
a Propensity Score 


5.2 Constructing 
the Propensity 
Score-Matched 
Sampie 


The steps required to perform a propensity score-matched study 
exemplify how matching can be applied to reduce the effects of 
confounding in an observational study of an exposure-outcome 
relationship [8]. 

In the first step of a propensity score-matched study, a propensity 
score is derived for each subject in the cohort. The propensity 
score is usually estimated using a logistic regression model, in 
which the exposure of interest is the dependent variable, and sev¬ 
eral baseline covariates are included as independent variables [17]. 
When the exposure of interest is a treatment, the propensity score 
represents the estimated probability, conditional on each subject’s 
covariates, of receiving the treatment. 

Next, the propensity score-matched sample is constructed. There 
are several methods for matching a randomly sorted cohort of 
exposed and unexposed subjects on the basis of a propensity score. 
In nearest neighbor matching, treated subjects are matched to 
untreated subject with the closest propensity score. When match¬ 
ing is performed within a specified caliper width, a maximal allow¬ 
able range in the difference of the propensity score for each pair of 
exposed and unexposed pairs is specified, and matches are selected 
only if their difference in propensity score falls within this caliper. 
Matching using a caliper width of 0.2 of the standard deviation of 
the logit of the propensity score has been shown to result in good 
balance [8]. Matching may also be done with or without replace¬ 
ment. In matching with replacement, an unexposed subject remains 
a possible match for more than one exposed subject, allowing some 
individuals to be included in multiple matched pair. When match¬ 
ing without replacement, each unexposed and exposed individual 
is matched only once. Greedy matching approaches match the next 
unexposed subject to an exposed subject, even if that unexposed 
subject would have been a better match for a subsequent exposed 
subject. Conversely, optimal matching algorithms select matches 
that will minimize the average difference in propensity scores 
across all possible matches. 
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5.3 Assessing 
the Balance 
of Covariates in 
the Matched Sample 


A comparison of the balance in the baseline covariates between 
exposed and unexposed subjects in the matched sample is neces¬ 
sary to determine whether the matching process has been success¬ 
ful. Standardized differences are preferred over significance testing 
because, unlike significance testing (and corresponding ^-values) 
their values will not be influenced by the smaller size of the matched 
sample compared to the overall cohort [8]. For continuous vari¬ 
ables, the degree of imbalance within the matched pairs can be 
calculated as: 


d = 100 x 


y _ y 

“^exposed ^unexposed 



$ exposed $ unexposed ^ ^ ^ 


where v exp osed and AWxposed are the sample means, and T exp osed and 
Funexposed are the sample standard deviation of the covariate in the 
exposed and unexposed groups, respectively. 

Similarly, for dichotomous variables, the standardized difference 
can be calculated as: 


d = 100 x 
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5.4 Estimating 
the Association 
Between Exposure 
and Outcome(s) 


where imposed and P unex .p 0se d represent the proportion of exposed and 
unexposed subjects respectively. 

Standardized differences of more than 10 % are generally used 
as a threshold to suggest imbalance between the groups. The deri¬ 
vation of the propensity score, selection of the matched sample, 
and assessment of balance between the treated and untreated can 
be an iterative process, in which earlier steps are modified and 
repeated until an acceptable balance in is covariates is achieved 
between the exposed and unexposed groups. 

In the final step, the association between the exposure and 
outcome(s) of interest is evaluated within the cohort of matched 
pairs. Because matched pairs are not independent observations, the 
subsequent analytical strategy to assess the exposure-outcome 
relationship should account in the correlation in the matched pairs. 
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Chapter 8 


Longitudinal Studies 5: Development of Risk 
Prediction Models for Patients with Chronic Disease 

Navdeep Tangri and Claudio Rigatto 

Abstract 

Chronic diseases are now the major cause of ill health in both developed and developing countries. Chronic 
diseases evolve, over decades, from an early reversible phase, to a late stage of irreversible organ damage. 
Importantly, the trajectory of individual patients with a chronic disease is highly variable. This uncertainty 
causes substantial stress and difficulty for patients, care providers and health systems. Clinical risk predic¬ 
tion models address this uncertainty by incorporating multiple variables to more precisely estimate the risk 
of adverse events for an individual patient. In the current chapter, we describe the general approach to 
developing a risk prediction model. We then illustrate how these methods were applied in the development 
and validation of the Kidney Failure Risk Equation (KFRE), which accurately predicts the risk of kidney 
failure in patients with Chronic Kidney Disease Stages 3-5. 

Key words Chronic disease, Prognosis, Risk prediction models 


1 Introduction 


Chronic diseases are now the major cause of ill health in both 
developed and developing countries. Unlike acute illnesses, chronic 
diseases evolve over decades, from an early, preclinical, and reversible 
phase, to a late stage characterized by irreversible organ damage. 
Importantly, the trajectory of individual patients with a chronic 
disease is variable and difficult to predict. This prognostic uncer¬ 
tainty causes many difficulties for patients, care providers and 
health systems. Consider, for example, the specific case of chronic 
kidney disease. 

Chronic kidney disease (CKD) is defined as the presence of 
persistent reduction in kidney function (i.e., glomerular filtration 
rate (GFR) <60 mL/min for more than 3 months) or evidence 
of chronic kidney damage (e.g., proteinuria), and afflicts 10-13 % 
of adults worldwide [1-5]. The major causes of CKD are hyper¬ 
tension, diabetes, atherosclerotic vascular disease, and certain 
glomerular diseases (e.g., IgA nephropathy). Even within these 
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categories, however, there is tremendous variation in rates of 
progression to kidney failure (i.e., needing dialysis) [6-8]: some 
patients progress rapidly to kidney failure, whereas others remain 
stable indefinitely with minor reduction in kidney function. In 
addition, since patients with chronic kidney disease (CKD) often 
suffer from multiple comorbid conditions, they are at risk of devel¬ 
oping competing outcomes, such cardiovascular disease and death. 
Indeed, many patients will die of cardiovascular disease before their 
kidney disease progresses to failure. Although the risk of all three 
adverse events (i.e., kidney failure, cardiovascular disease, death) is 
high, the risk for each individual event in a given patient is difficult 
to predict. 

This variability is a significant problem for several reasons. For 
patients, uncertainty hampers psychological adaptation to illness, 
and degrades quality of life [9-12]. Patients need to know: what 
will happen to my kidneys? Will I need dialysis? If so, how soon? 
For health professionals, lack of accurate prognostic estimates 
makes it difficult to appropriately counsel CKD patients, plan fre¬ 
quency of follow-up, and determine optimal timing for invasive 
procedures required in preparation for dialysis, such as arteriove¬ 
nous fistula creation, or referral for preemptive transplantation. 
From the health systems perspective, CKD care is expensive, and 
requires specialized resources and frequent visits. These resources 
should be directed to patients at high risk, and not those at mini¬ 
mal risk of adverse outcomes. 

Clinical risk prediction models address this problem by incor¬ 
porating multiple variables to more precisely estimate the risk of 
adverse events for an individual patient [13]. Over the last three 
decades, there has been an increase in the use of risk prediction 
models with integration into multiple aspects of medical care [14]. 
In particular, instruments to predict cardiovascular risk have revo¬ 
lutionized the care of patients with cardiovascular disease, and pro¬ 
vided novel insights into the prognostic role of individual risk 
factors and the efficacy of medical interventions [15, 16]. More 
recently, a robust tool to prognosticate risk of kidney failure has 
been developed and validated [17]. 

In the current chapter, we describe the approach to developing a 
risk prediction model in general terms. We then illustrate how these 
methods were applied in the development and validation of the 
Kidney Failure Risk Equation (KFRE), which accurately predicts 
the risk of kidney failure in patients with CKD Stages 3-5 [17]. 


2 Methods of Model Development 

In order to be clinically useful, a prediction model must be inter¬ 
nally valid, show improved reclassification of patients at risk, be 
externally valid in independent cohorts, and be easily applicable at 
the bedside. 
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2.1 Internal Validity: 
Getting the Basics 
Right 


2.2 Metrics 
of Predictive 
Performance 


Internal validity refers to the concept that a prediction model must 
be derived from the study sample in such a way that the model 
coefficients accurately reflect the true relationships between the 
predictor variables and the outcome of interest. Internal validity 
therefore requires that the prediction model be derived from an 
appropriately structured and assembled cohort. The cohorts chosen 
for model development (and later validation) are most commonly 
derived from published prospective cohorts, RCT cohorts, or 
assembled from administrative databases using specific case defini¬ 
tions. Although a new prospective cohort study could be created 
for this purpose, such a strategy would offer minimal advantages 
unless the purpose was to incorporate novel tests or biomarkers 
into a prediction model; in such cases, suitable stored samples are 
often not available from an existing cohort, justifying the time and 
resources required to mount a prospective study. As with any valid 
prognostic study, the cohort must be structured so that predictor 
variables of interest are assessed in all patients at a point in time 
well before the development of the outcome. The proposed pre¬ 
dictor variables ideally must have face validity, be clearly defined, 
and be precisely measurable. To avoid ascertainment bias, the out¬ 
come must be defined unambiguously, and assessed equally in all 
patients by adjudicators blinded to the predictor variable status of 
patients in the study [13]. 

Appropriate statistical approaches to model building are also of 
importance for internal validity. The statistical model chosen 
should fit the nature of the data. In most cases, if censoring is neg¬ 
ligible and the follow-up period clearly defined, logistic regression 
is used; if censoring is significant or time to event is important, 
then a survival time approach using a Cox proportional Hazards 
model is preferred. Other more complex model approaches exist 
but are beyond the scope of this chapter [18, 19]. 

To ensure stability of the model coefficients in logistic and Cox 
regression, an event frequency of at least 10/events per degree of 
freedom in the model is advised [13]. For example, in a cohort of 
1,000 patients where 100 outcomes have been observed, the pre¬ 
diction model should include at most 10 variables. Ratios of less 
than 10 events per variable can result in over fitting of the data, 
leading to poor generalizability in other patient cohorts. All these 
general aspects of study and model specification should be described 
in the methods to allow assessment of internal validity. 

In addition to the general methodological issues enumerated 
above, which are germane to all studies of prognosis, several met¬ 
rics specific to prediction model performance have been developed 
to further assess internal validity. 

Discrimination measures the ability of a model to accurately assign 
a higher probability to patients who have the event of interest, 
versus those who do not. The most commonly used metric of dis¬ 
crimination for logistic and Cox models is the concordance or C- 
Statistic. The C statistic is defined as the proportion of times the 
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model correctly discriminates between a randomly selected pair of 
individuals (case and control), and is mathematically equivalent to 
the area under the receiver operating characteristic curve (AUROC) 
of the logistic or proportional hazards model. As with the AUROC, 
a c statistic of 0.50 indicates that the model performs no better than 
chance; a c statistic of 0.70-0.80 indicates good discrimination; 
and a c statistic of greater than 0.80 is consistent with excellent 
discriminatory ability [20]. 

A necessary step in developing prediction models is comparing 
the discrimination of two alternative models in order to decide which 
is better. This has traditionally been done by comparing the differ¬ 
ence in c-statistics between the models [20]. However, one of the 
limitations of using the c statistic for this purpose is that it exhibits 
asymptotic behavior: as the model c approaches 1, it becomes increas¬ 
ingly difficult to show a meaningful difference in C-statistics despite 
real improvements in model prediction. An alternative and more 
sensitive measure of improvement is the integrated discrimination 
improvement index (IDI) [21, 22]. The IDI measures the differ¬ 
ence in discrimination slopes between the two models (i.e., mean 
predicted probability for those with the outcome vs. those without), 
and describes this on an absolute and relative scale. As such, the IDI 
can be an effective method for comparing discrimination between 
two models where differences in C statistic may be negligible. 

2.2.2 Calibration Model calibration refers to how well the model predictions agree 

with the actual data. For logistic regression models, the Hosmer- 
Lemeshow chi square statistic is commonly employed for assessing 
calibration. In this test, patients are ranked into deciles of predicted 
probability; and the mean probability in each decile is then com¬ 
pared with the actual frequency of outcome among patients in that 
decile. A chi square test is used to assess whether a significant dis¬ 
crepancy exists between predicted and actual probabilities; a sig¬ 
nificant chi square statistic is interpreted as evidence that the model 
calibration is poor [15]. Alternative measures of calibration such as 
the Brier score which are not dependent on statistically significant 
differences in each decile have also been studied, but require 
further testing in the biomedical literature. 

Calibration is an important metric for a clinical prediction 
model, because poorly calibrated models exhibit marked under or 
over prediction of risk, which is problematic when used for clinical 
decision making. 

2.2.3 Reclassification In clinical medicine, treatments and tests are often prescribed 

based on the predicted risk category of having an event. When a 
new prediction model is developed, it is important to consider 
whether it classifies patients into more appropriate risk strata than 
the old model. The new model may assign a given patient to the 
same risk category, a lower risk category, or a higher risk category 
relative to the old model. If the patient has an event, the new 
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2.2.4 External 
ValidityA/alidation 


2.2.5 Knowledge 
Translation 


model can be considered successful if it assigns that patient to a 
higher risk stratum, but unsuccessful if it assigns a lower risk. 
Similarly, for patients who do not have an event, the new model is 
successful if reclassifies to lower risk, and unsuccessful if it reclassifies 
to a higher risk stratum. This net success or failure is summarized 
by the Net Reclassification Index. Positive values for NRI indicate 
correct reclassification and negative values indicate incorrect reclas¬ 
sification. The NRI should be calculated using clinically accepted 
risk categories, wherever possible [21, 22]. 

External validity addresses the question of whether the results of 
the study sample are generalizable to the broader population in 
which one would like to apply the results. External validity for pre¬ 
diction models can never be assumed, as it is well known that mod¬ 
els generated from one set of data (derivation set) usually perform 
less well in other cohorts. This may be due to a variety of reasons, 
including true differences in disease or biology between the 
cohorts, a non-representative derivation or validation cohort, over 
fitting of the initial models and/or spurious associations between 
predictors and outcomes in the derivation cohort. Some of these 
sources or error can be minimized in the derivation phase, by care¬ 
fully choosing a cohort of patients representative of the clinical 
condition, and by choosing candidate predictor variables on face 
value rather than statistical association. Nevertheless, it is critical to 
perform a validation step by demonstrating that model calibration, 
discrimination and reclassification remain useful in at least one 
other, and ideally many other, independent cohorts separated by 
time, geography or source population. Although typically all the 
metrics of model performance will be lower in the validation cohort 
as compared with the original derivation dataset, provided that the 
drop in performance is not clinically significant, the model can be 
considered generalizable [13]. 

The best predictive model is unusable at the bedside unless it can 
be rapidly and efficiently applied. Most predictive models are 
derived from complex logistic or proportional hazards models and 
include multiple coefficients and exponents that cannot be easily 
permuted using simple arithmetic. This means that the model, 
once validated, needs to be translated into a clinically useful bed¬ 
side tool. Until recently, this meant transforming the model into a 
simpler scoring system, with some inevitable loss of precision, dis¬ 
crimination, and calibration. However, with advent of smart¬ 
phones, and rapid access to web-based calculators or smartphone 
apps, the predictive equation can be applied exactly, via a simplified 
user interface, and without complex calculations. We believe that 
the integration of prediction models into electronic user interfaces, 
and particularly into electronic health records and laboratory infor¬ 
mation systems can greatly enhance knowledge translation 
[23-26]. 
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3 Practical Application: Development of the Kidney Failure Risk Equation 


3.1 Derivation 
Cohort Selection 


3.2 Selection 
of Variables 


3.3 Model 
Development 


In the following section, we describe our approach to developing a 
risk prediction equation for Kidney failure applicable to patients 
with stage 3-5 chronic kidney disease [17]. 

The development cohort was derived from a clinical database, the 
nephrology clinic electronic health record (EHR) at Sunnybrook 
Hospital, a part of the University of Toronto Health Network. This 
database was a prospective registry of patients seen by the nephrol¬ 
ogy group at Sunnybrook Hospital and included reliable informa¬ 
tion on predictor variables and outcomes of interest. Patients with 
CKD stages 3-5 (estimated GFR, <60 mL/min/1.73 m 2 ) at the 
time of initial nephrology referral were included and were followed 
up between April 1, 2001, and December 31, 2008. The outcome 
of interest, kidney failure requiring dialysis or transplantation, was 
ascertained by reviewing clinic records and by matching patient 
ID’s with the Toronto Regional Dialysis Registry, a comprehensive 
registry of all patients receiving dialysis in the Toronto area. 

We selected candidate predictor variables on the basis of face validity. 
The pool of variables explored included age and sex; blood pres¬ 
sure and weight; comorbid conditions, including diabetes, hyper¬ 
tension, and etiology of kidney disease; and laboratory variables 
from serum and urine collected at the initial nephrology visit. 
All predictor variables were obtained at baseline from the nephrol¬ 
ogy clinic EHR in the development data set. Models were devel¬ 
oped using Cox proportional hazards regression methods and 
evaluated using C statistics and integrated discrimination improve¬ 
ment (IDI) for discrimination, calibration plots and Alcaike 
Information Criterion for goodness of fit, and net reclassification 
improvement (NRI). 

We developed a sequential series of models and compared those with 
more variables (i.e., greater complexity) to simpler ones. We used 
a combination of clinical guidance and forward selection to deter¬ 
mine variable selection. Variables not associated with kidney failure 
(P>0.10) on univariate Cox regression were excluded from fur¬ 
ther analyses. The 6 models constructed are shown in Table 1. 
Models 1 through 3 were developed using age and sex, estimated 
GFR, and albuminuria, successively, and compared with each other. 
Models 4 through 7 were developed by adding either clinical vari¬ 
ables (diabetes and hypertension), physical examination variables 
(systolic and diastolic blood pressure and weight), laboratory vari¬ 
ables of CKD severity (which were associated with the outcome in 
multivariate forward selection), or all of the above and compared 
with model 3. 
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Table 1 

Hazard ratios, goodness of fit, and discrimination of prediction models in the derivation cohort 3 


Variable 

Models 







1 

2 

3 

4 

5 

6 

7 

Baseline GFR, per 5 mL/min/1.73 m 2 


0.54 

0.57 

0.58 

0.60 

0.61 

0.64 

Age, per 10 year 

0.86 

0.75 

0.80 

0.80 

0.79 

0.82 

0.82 

Male sex 

1.03 b 

1.46 

1.26 

1.27 

1.34 

1.16 b 

1.26 

Log spot urine ACR C 



1.60 

1.61 

1.55 

1.42 

1.37 

Diabetes 




0.86 b 



0.88 b 

Hypertension 




1.17 b 



0.89 b 

Systolic BP, per 10 mm Hg 





1.15 


1.14 

Diastolic BP, per 10 mm Hg 





1.10 


1.15 

Body weight, per 10 kg 





0.91 


0.91 

Serum albumin, per 0.5 g/dL 






0.84 

0.83 

Serum phosphate, per 1.0 mg/dL 






1.27 

1.34 

Serum bicarbonate, per 1.0 mEq/L 






0.92 

0.93 

Serum calcium, per mg/dL 






0.81 

0.82 

C statistic d 

0.56 

0.89 

0.91 

0.91 

0.92 

0.92 

0.92 

Akaike Information Criterion' 1 

5,553 

4,834 

4,520 

4,521 

4,463 

4,432 

4,378 

P value 


<0.001 

<0.001 

0.40 

<0.001 

<0.001 

<0.001 


Reproduced with permission from Tangri et al. JAMA 2011; 305:1553-1559 (ref. 17) 

ACR albumin-to-creatinine ratio, BP blood pressure, GFR glomerular filtration rate 

a Data are presented as hazard ratios unless otherwise specified. Models 2, 3, and 6 columns indicate models based on 
laboratory data. Rvalues are for comparison of C statistics between successive models, except for models 5,6, and 7, 
which are compared with model 3 

b Hazard ratios with P> 0.05; all other hazard ratios are significant (i.e., P< 0.05) 

c Hazard ratio for ACR represents a 1.0 higher ACR on the natural log scale. For the average patient with 20 mg/g of 
albuminuria, this represents an increase to 55 mg/g 

d Null values for C statistic and Akaike Information Criterion are 0.50 and 5,569, respectively. Higher values for C statistic 
and lower values for Akaike Information Criterion indicate better models 


3.4 Validation Cohort The validation cohort was derived from another prospective clinical 
Selection database, the British Columbia CKD Registry (Patient Registration 

and Outcome Management Information System), which captures 
clinical and laboratory data on all patients referred to nephrologists 
in BC. Patients with CKD stages 3-5 at the time of initial nephrol¬ 
ogy referral between January 1, 2001, and December 31, 2009, 
were included. Outcomes such as dialysis, death, and transplantation 
are all captured in the database, which matches all kidney failure 
outcomes with provincial and national registries. Note that the use 
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3.5 Results 
of the Study 

3.6 Prediction Model 
Performance 

in the Development 
Cohort 


3.7 Prediction Model 
Performance 
in the Development 
Cohort 


of a separate validation cohort in an entirely different region of 
Canada and with a different case-mix of patients and practice patterns 
is a strength and enhances the inference of external validity. 

The development and validation cohorts included 3,449 patients 
(386 with kidney failure [11 %]) and 4,942 patients (1,177 with 
kidney failure [24 %]), respectively. 

The hazard ratios for the variables and statistics for discrimination 
and goodness of fit for successive models in the development cohort 
are shown in Table 1. Model 1, including age and sex only, per¬ 
formed poorly (C statistic, 0.561; 95 % confidence interval [CI], 
0.529-0.593). The C statistic improved with the inclusion of esti¬ 
mated GFRin model 2 (0.892; 95 % CI, 0.874-0.910; P< 0.001) 
and albuminuria in model 3 (0.910; 95 % CI, 0.894-0.926; 
P< 0.001), did not improve with the addition of diabetes and 
hypertension in model 4 (0.909; 95 % CI, 0.893-0.925; P=0.40), 
and did improve with the inclusion of blood pressure and body 
weight in model 5 (0.915; 95 % CI, 0.899-0.931; P< 0.001) and 
laboratory values in model 6 (0.917; 95 % CI, 0.901-0.933; 
P< 0.001). Despite a similar C statistic, the AIC was lower (i.e., bet¬ 
ter fit) for model 6 than for model 5 (4,432 vs. 4,463, respectively). 
The inclusion of all variables in model 7 improved both the C 
statistic and the AIC compared with model 3 (0.921 [95 % CI, 
0.905-0.937] vs. 0.910 [95 % CI, 0.894-0.926] and 4,378 vs. 
4,520, respectively). Given these results, models 1, 4, and 5 were 
excluded from further evaluation. Models 2, 3, 6, and 7 were then 
tested in the validation cohort. 

In both cohorts, the C statistic was higher for model 6 compared 
with models 2 and 3 in the entire population. In the validation 
cohort, no further improvement was observed with the additional 
non-laboratory variables (0.835; 95 % CI, 0.819-0.851 vs. 0.841; 
95 % CI, 0.825-0.857; P= 0.90 for model 7 vs. model 6). At all 
times, both the C statistic and integrated discrimination improve¬ 
ment were greater for model 6 compared with models 2 and 3 
(P< 0.001 for all comparisons). Model 6 was more accurate than 
model 3 (integrated discrimination improvement, 3.2 %; 95 % CI, 
2.4-4.2 %). Figure 1 shows observed vs. predicted probability of 
kidney failure at 3 years for models 2, 3, and 6 in the validation 
cohort. The mean absolute difference between the observed and 
predicted probabilities over quintiles of risk was lower with model 6 
compared with models 2 and 3 (1.9 % vs. 2.7 % and 3.1 %, respec¬ 
tively), and the Nam and D’Agostino y 2 statistic also indicated 
improved fit with model 6 compared with models 2 and 3 (y 2 statistic, 
19 vs. 37 and 32, respectively). 
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Modal2 




Modei6 



Prettied ftskdwti& 


Pmdcted OJntfle 
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The predicted and observed event probabity estimates represent the mean predicted probabity from the Co* proportional hazards regression model and the mean 
observed probability from the population (Kaplan Maef estimate] divided into quintiles of predicted probability. Predicted risk categories for quribles 1 through 5 
correspond with 0% to 4.3% P 4.4% toS.I %. 8.2% to 12.9%. 13.0% to 24.5%, and 24.6% to 53.9%. respectively, for model 2; 0% to 1.6%, 1.7% to 53%, 5.4% 
to 11.0%. 11.1% to 23.1%, 2U% to 61.7%, respecbvely. for model 3; and 0% to 1.4%, 1.4% to 4,8%, 4,9% to 10,7%, 10.8% to 24,0%. 24,1% to 61,6%, 
respectively, for model 6. Nam and D Agostino x 2 statistic is 37,32. and 19 for models 2,3, and 6. respectively. 

Fig. 1 Observed vs. predicted risk of kidney failure at 3 years for Models 2, 3, and 6 in the validation cohort. 
Reproduced with permission from Tangri et al. JAMA 2011; 305:1553-1559 (ref. 1 7) 


3.8 Net As discussed in part 1, the NRI calculates the overall improvement 

Reclassification in risk classification as a result of the prediction model. To perform 

this calculation, it is necessary to have some a priori definition of 
discrete risk levels for the outcome of interest. Ideally, these risk 
levels should reflect currently accepted schema that are known to 
alter clinical decisions (e.g., in cardiovascular disease, many thera¬ 
peutic decisions are based on the Framingham risk category). As 
established risk levels for kidney failure do not presently exist, we 
were obliged to define risk strata according to a “clinical reason¬ 
ableness” criterion. To enhance the face validity of these thresh¬ 
olds, we asked clinicians what specific levels of risk would likely 
influence their management decisions for patients with stage 3 and 
stage 4 disease, respectively. Based on this input, we defined stage- 
specific CKD risk categories as follows. 

1. For CKD stage 3, the risk of kidney failure over 5 years was 
classified into the following categories: low (<5 %), medium 
(5.0-14.9 %), and high (>15.0 %). 

2. For CKD stage 4, the risk of kidney failure over 2 years was 
classified as: Low (<10 %), medium (10.0-19.9 %), and high 
(>20.0 %). 

Once the risk strata were defined, we then calculated NRI 
within each CKD stage for Model 6 vs. Model 2, Model 3 vs. 
Model 2, and Model 6 vs. Model 2 (Table 2). For patients with 





























154 


Navdeep Tangri and Claudio Rigatto 


Table 2 

Net reclassification improvement of the models in the validation cohort 


No. (%) of participants 


NRI and non-NRI events, 

NRI events Non-NRI events no. (%) [95 % Cl] 


CKD Stage 3 a 


Models 

3 vs. 2 

6 vs. 2 

6 vs. 3 

^=248 

76 (30.6) 

91 (36.7) 

21 (8.5) 

^ = 2,159 

296 (13.7) 

296 (13.7) 

-11 (-0.5) 

372 (44.4) [36.5-52.2] 
387 (50.4) [42.7-58.1] 
10 (8.0) [2.1-13.9] 

CKD Stage 4 a 



Models 

n =400 

n= 1,695 


3 vs. 2 

10 (2.5) 

374 (22.1) 

384 (24.6) [17.7-31.4] 

6 vs. 2 

5 (1.3) 

432 (25.5) 

437(26.7) [20.1-33.3] 

6 vs. 3 

-2 (-0.5) 

78 (4.6) 

76 (4.1) [-0.5 to 8.8] 


Reproduced with permission from Tangri et al. JAMA 2011; 305:1553-1559 (ref. 17) 
Cl confidence interval, CKD chronic kidney disease, NRI net reclassification 
improvement 

a Risk categories for CKD stage 3 are 0-4.9 %, 5.0-14.9 %, and 15.0 % or more over 
5 years, and for CKD stage 4 are 0-9.9 %, 10.0-19.9 %, and 20.0 % or more over 2 years 


3.9 Knowledge 
Translation 


3.10 Subsequent 
Validation Steps 


CKD Stage 3 and Stage 4, Model 3 consistently outperformed 
Model 2 with clinically meaningful improvements in reclassifica¬ 
tion. Similarly, Model 6 also outperformed Models 2 and 3, albeit 
with a smaller magnitude in reclassification improvement when 
compared to Model 3. Together these findings suggested that 
inclusion of albuminuria (Model 3) is critical for prognostication 
for kidney failure events, and the additional serum laboratory vari¬ 
ables (Model 6) provide an incremental benefit in reclassification. 

Prediction equations are not useful unless they can be readily 
applied at the bedside to aid clinical decisions. To aid knowledge 
translation, we provided an online appendix that provided the written 
risk equation, and an online Excel spreadsheet to calculate the 
5 year risk of kidney failure. Furthermore, we partnered with a 
mobile software developer to include the KFRE in medical calcula¬ 
tor applications (QxMD) for use with iOS, Android, and Blackberry 
based devices. This app was downloaded and accessed >60,000 
times in first year of publication. 

Our original analysis comprised a single external validation step, 
which is really only a minimum requirement for establishing external 
validity. It is highly desirable to measure the performance of the pre¬ 
diction model in additional populations, as these analyses will pro¬ 
vide insight into how reliably the model performs in populations 
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that differ in terms of geography, case mix, and clinical practice. 
For this reason, our group proceeded to examine the performance 
of the Kidney Failure Risk Equation in multiple geographically, 
ethnically, and etiologically diverse CKD cohorts. 

In addition to validation studies conducted by other indepen¬ 
dent investigators [27, 28], we collaborated with investigators from 
the CKD Prognosis Consortium and performed a validation study 
of the KFRE (i.e., Models 3 and 6 as described above) in 23 cohorts, 
spanning 10 countries and 4 continents [29]. This validation study 
included 562,000 individuals who had 17,000 kidney failure events 
over a median follow up of 7 years. Across all of our validation 
cohorts, the original KFRE performed extremely well and achieved 
discrimination statistics that exceeded the original validation (pooled 
C Statistic >0.85 for Models 3 and 6 at 5 years). Calibration was also 
excellent at 2 and 5 years in most cohorts, suggesting that the KFRE 
could be used globally to facilitate clinical decision making based on 
absolute risk thresholds at these intervals. 


4 Summary 

Risk prediction equations are of increasing importance in clinical 
medicine, particularly in the management of chronic diseases. Risk 
prediction equations can help patients to better understand their 
prognosis, health practitioners to more accurately prescribe inter¬ 
ventions, and health services organizations to better target popula¬ 


tions at risk. 
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Randomized Controlled Trials 1: Design 

Bryan M. Curtis, Brendan J. Barrett, and Patrick S. Parfrey 

Abstract 

Today’s clinical practice relies on the application of well-designed clinical research, the gold standard test 
of an intervention being the randomized controlled trial. Principles of the randomized control trial include 
emphasis on the principal research question, randomization, blinding; definitions of outcome measures, of 
inclusion and exclusion criteria, and of comorbid and confounding factors; enrolling an adequate sample 
size; planning data management and analysis; preventing challenges to trial integrity such as drop-out, 
drop-in, and bias. The application of pretrial planning is stressed to ensure the proper application of 
epidemiological principles resulting in clinical studies that are feasible and generalizable. In addition, 
funding strategies and trial team composition are discussed. 

Key words Clinical trial, Randomization, Blinding, Sample size estimate 


1 Introduction 


The randomized controlled trial (RCT) is the gold standard of 
clinical research when determining the efficacy of an intervention 
[ 1 ]. Similar to experiments utilizing the scientific control method, 
it attempts to test a hypothesis about the effect of one variable on 
another, while keeping all other variables constant. As most epi¬ 
demiological hypotheses usually relate to populations which are 
frequently diverse, the role of randomization is to obtain suitably 
comparable samples for evaluation. Furthermore, pretrial sample 
size estimation endeavors to ensure that the study will have the 
ability to detect clinical differences at a statistically significant level. 
While some clinical research questions may be inherently harder to 
answer than others (less amenable to the scientific control method), 
asking the right question coupled with the application of a well- 
designed trial can yield valuable information for practicing physi¬ 
cians. In addition to the above, the success and applicability of a 
randomized controlled trial depends on many aspects of research 
design that will be discussed. Indeed, a poorly designed 
randomized trial will not generate useful scientific information and 
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a well-designed and executed prospective observational study may 
be more valuable. 

Although the randomized controlled trial is the gold standard 
test of efficacy in the evidenced-based medicine era, there is emerg¬ 
ing concern that the interest of science and society should not 
prevail over the welfare of individual patients [2, 3]. Ethical Review 
Boards or Human Investigation Committees composed of medical 
professionals and nonmedical members attempt to ensure the 
safety and welfare of the participants in clinical trials. Approval 
from these bodies is mandatory before undertaking any research 
involving human subjects. Participants must also enter into the 
research having freely given informed consent [4, 5]. 

Other concerns in clinical trials include generalizability, limita¬ 
tions in recognizing small treatment effects, the inability to conduct 
trials of sufficient duration to mimic treatment of chronic disorders 
[2] increasing costs, and whether efficacious interventions can be 
applied effectively in the community. 


2 Asking the Question 

2A Identifying Clinical research questions arise from an identified problem in 

the Problem healthcare that requires study to provide evidence for change in 

clinical practice—specific questions are asked and trials are sub¬ 
sequently designed to obtain answers. Epidemiology provides the 
scientific foundation for the RCT by identifying risk factors and 
their distribution in the general population, establishing their role 
in predicting poor outcomes, as well as quantifying the potential 
value of treating and preventing the risk in the general population 
[6]. Furthermore, the RCT also represents the ultimate application 
of translating basic science research into clinical utility—a process 
commonly referred to as “bench to bedside.” The application of 
both the epidemiological and basic science research underpins the 
RCT, the rationale for which is to undertake a safe human experi¬ 
ment, and subsequently apply the premise that observation and 
interventions in groups can be directly applied to treatment and 
prevention of disease in individuals [7]. 

Observational studies remain necessary to plan RCTs, to 
complement the observations of a randomized controlled trial, to 
generate hypotheses to be later tested with RCTs, or in some cases 
to provide answers that cannot be obtained by means of a random¬ 
ized controlled trial [2]. For example, an RCT would not be 
appropriate to assess smoking cessation and its impact on preventing 
lung cancer. RCTs may be impossible in patients with rare diseases 
or may be logistically difficult when the primary clinical outcome 
occurs infrequently in large groups. Similarly, RCTs may not be 
appropriate when dealing with potentially rare but significant 
adverse effects. 
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2.2 The Principal 
Research Question 


In order to perform a clinical trial, the research question must 
satisfy a number of requirements: it must be a clinically important 
one from the perspective of patients, professionals, and society in 
general; there must be equipoise, meaning uncertainty about the 
answer to this question; the answer must be generalizable and 
applicable to a wide enough spectrum of medical practice. A good 
question will attract funding and help identify potential collabora¬ 
tors who are convinced the question is important. As the RCT is 
designed specifically to test a hypothesis, the research question 
must be amenable to the application of an experimental design. It 
must be logistically feasible as one attempts to keep all other condi¬ 
tions the same while manipulating an aspect of care in the experi¬ 
mental group. As the outcome is measured after a predetermined 
period, this too must occur within a reasonable time. For example, 
it would be very difficult to evaluate the effect of an intervention 
on an outcome if it is very rare or will not occur for 20-30 years. 

It is important to know whether the research question has 
already been answered and how strong the evidence is to support 
the answer. Extensive literature reviews must be performed using 
appropriate resources such as Medline or the Cochrane Database 
to identify related prior work. Demonstration of equipoise is 
important not only in allocation of scarce research funding but 
particularly for ethical reasons. Investigators must not deny patients 
known effective treatment for the sake of new knowledge. For 
example, it would now be unethical to knowingly treat a hyperten¬ 
sive patient with “placebo only” versus a new drug for prolonged 
periods to investigate potential benefits. It may, however, be neces¬ 
sary to repeat prior work for confirmation of results or to solidify 
conclusion through improvements in trial design. Of course, 
debate exists as to what constitutes good evidence or what determines 
the strength of evidence. Sometimes a thorough meta-analysis can 
replace the need for a large RCT. These questions cannot be 
answered in general and must be considered carefully for each case. 

Once investigators are satisfied that their research question is 
clinically important, they must outline a specific hypothesis to be 
tested—referred to as the alternate hypothesis. The null hypothesis , 
or the negation of the hypothesis being tested, is then determined. 
It is the null hypothesis that is tested using statistical methods. 
Statistical purists believe that one can only reject or fail to reject 
(versus accept) a null hypothesis. For example, if one wanted to 
assess the effect of using hydrochlorothiazide versus placebo for 
treating mild hypertension the alternate hypothesis would be: 
hydrochlorothiazide is better, on average, than placebo in lower¬ 
ing blood pressure. The null hypothesis would then be: hydrochlo¬ 
rothiazide is not better, on average, than placebo in lowering 
blood pressure. 
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3 Trial Design 

3.1 Randomization The goal of randomization is to ensure that participants in different 

study groups are comparable at baseline. This includes variables 
that are known, unmeasurable, and unknown. Thus, the only 
difference between the groups will be the intervention under 
investigation. In clinical research, there is a mix of demographic 
and other attributes about which data can be collected during the 
trial. If these factors are similar in the intervention and control 
group, we can assume that the objectives of randomization were 
achieved and that distribution of factors about which the investiga¬ 
tors are unaware will not systematically affect results. The probabil¬ 
ity of a participant’s a priori enrollment must be independent of 
group assignment (i.e., intervention or conventional therapy), 
ensuring no selection bias. The process is termed “allocation con¬ 
cealment” and prevents participants being enrolled in a trial on the 
condition they only receive a prespecified intervention [8, 9]. To 
maximize validity of this process in multicentre trials the random¬ 
ization and assignment should occur at a central location. This 
ensures consistency and decreases selection bias. Other options 
include the use of sequentially numbered, opaque, sealed enve¬ 
lopes or containers, the use of which can be audited. 

Simple randomization as described above can be augmented 
by various means. One such technique is called blocking. This 
involves the random allocation occurring in blocks in order to keep 
the sizes of intervention arms similar. For example, if the total 
sample size is one hundred, participants may be “blocked” into 
subsets that guarantee an assignment balance after each block is 
enrolled (e.g., if using blocks of 5, there would be 30 participants 
in each arm after 60 are enrolled). Another technique is “stratified 
randomization” and is used when investigators are concerned that 
one or more baseline factors are extremely important in determin¬ 
ing outcome and want to ensure equal representation of this factor 
in both treatment arms—e.g., diabetics in RCTs of cardiac or 
chronic kidney disease. When participants are screened they are 
stratified by whether the factor, like diabetes, is present or not 
before they are randomized within these strata. “Cluster random¬ 
ization” involves randomizing entire groups of participants en 
bloc, for example by hospital or location [10]. This method is 
particularly useful in the evaluation of nontherapeutic interven¬ 
tions such as education, quality improvement or community based 
interventions [li]. 

The principles of randomization discussed above are for two 
treatment arms with 1:1 representation in each arm. These princi¬ 
ples are also applicable to other trial designs where there are three 
or more treatment arms such as placebo vs. drug A vs. drug B. 
Additionally, investigators may wish to have more participants in 
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Fig. 1 Possible outcomes in a 2x2 factorial design. Each participant has an 
equal probability of receiving one of the four intervention combinations 

one treatment arm versus the other. For example, 2:1 randomiza¬ 
tion has two-thirds of the participants randomized to one group 
and one-third to the other. This technique may be used to ensure 
adequate participant numbers in a treatment arm to evaluate side 
effects of a new medication. 

A factorial design is one whereby two or more independent 
interventions are tested simultaneously during the same trial 
(Fig. 1). The Heart Outcomes Prevention Evaluation (HOPE) 
trial is one example of factorial design in which patients were 
randomized to receive either Ramipril, Vitamin E, Ramipril and 
Vitamin E, or Placebo [12, 13]. Therefore, participants had an 
equal probability of receiving one of four interventions: (1) 
Ramipril and Placebo, (2) Placebo and Vitamin E, (3) Ramipril 
and Vitamin E or (4) Placebo and Placebo. The benefits of this 
design are reduced costs as compared to testing Ramipril and 
Vitamin E in two different RCTs, and detecting interactions, such 
as those between an angiotensin converting enzyme inhibitor 
(Ramipril) and an antioxidant (Vitamin E). It is possible that an 
interaction occurs between two interventions such that the effect 
of the combination is different than the additive effect of both 
interventions separately—the effect size of one intervention may 
depend on the presence or quantity of the other intervention. 

3.2 Adaptive Trials Because of concerns that clinical trials are time consuming, expen¬ 
sive and may be prone to failure (failing to demonstrate a clinically 
important benefit when in fact one exists), newer techniques have 
been designed to adjust the course of a clinical trial as data accrues 
[14-16]. These are termed flexible or adaptive trials. They involve 
changing a newly recruited participant’s odds of allocation to 
different treatment arms based on outcomes already achieved as 
the trial progresses. Concerns include the ability to maintain equi¬ 
poise, potential participants delaying enrollment hoping they are 
more likely to be assigned more effective interventions, and early 
data signals that may eventually be false [14]. Another alternative 
is event driven enrollment where the number of participants 
enrolled (i.e., sample size) is determined by the actual event rate in 
the RCT, rather than the projected event rate from other studies. 
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3.3 Crossover 


3.4 Non-inferiority 


3.5 Blinding or 
Masking 


The crossover design is a strategy that may be employed when the 
outcome is reversible. It involves participants being assigned to 
one intervention and subsequently getting the competing inter¬ 
vention at a later time, preferably in random order. The advantage 
is that participants can serve as their own control. It is particularly 
suited for short-term interventions and outcomes where time 
related trends are not expected. The design is efficient in that treat¬ 
ments are compared within individuals, reducing the variation or 
noise due to subject differences. However, loss to follow-up can be 
particularly problematic when interpreting results. Further limita¬ 
tions include possible differential carryover (one of the treatments 
tends to have a longer effect once stopped); period effects (differ¬ 
ent response of disease to early versus later therapy); and a greater 
impact of missing data because they compromise within subject 
comparison and therefore variance reduction [17]. 

Given that there are situations whereby proven effective interventions 
may exist, placebo-controlled trials may be unethical [18, 19]. 
Furthermore, as knowledge accumulates, the incremental benefit 
from interventions may be small, requiring large sample sizes to 
demonstrate a benefit. Non-inferiority trials (sometimes incor¬ 
rectly called equivalence) are intended to show that the effect of a 
new treatment is not worse than that of an active control by more 
than a specified margin [20]. For example, the claim might be 
made that a new ACE inhibitor is non-inferior to Enalapril, if the 
mean 24 h blood pressure of the new ACE inhibitor was not 
greater than 3 mmHg more than that of Enalopril. One major 
concern with this type of design is the loss of factors in trial design 
that ensure conservative interpretation of the outcomes. For exam¬ 
ple, loss to follow-up will tend to protect against type I error in 
placebo controlled trials but will bias towards the new treatment in 
a non-inferiority trial. 

Blinding is a technique utilized to decrease both participant and 
investigator bias. An “open trial” describes the situation where 
both the participants and investigators are aware of all treatment 
assignments and details. A “single-blind” trial is one where the 
researcher knows the treatment but the participant is unaware. 
Double-blinding entails both participants and investigators not 
knowing which treatment arm the participant is enrolled in. This 
removes bias on the investigators part as it is possible the partici¬ 
pant may be treated differently by investigators depending on 
assignment, even subconsciously so. Double-blinding is sometimes 
not possible or may be inappropriate, such as with trials of surgical 
intervention. If double-blinding is not possible, then it is especially 
important to have objective outcomes and those involved in 
outcome measurement and adjudication being blind to the inter¬ 
vention received. 
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3.6 Multicenter 


3.7 Planned Trial 
Allocations 
and Interventions 


3.8 Inclusion 
and Exclusion Criteria 


3.9 Primary 
and Secondary 
Outcome Measures 


Multicenter trials, although increasing logistical complexity, will 
allow for greater enrollment opportunities. They also have the 
advantage of diminished “center effect,” making results of the trial 
more applicable to a wider spectrum of patients. It helps to include 
academic and nonacademic centers where possible to further 
decrease case-mix bias, as the spectrum of patients and disease seen 
at various centers may be different. 

It is important to precisely outline what the intervention will be 
and how both the treatment and the control groups will be man¬ 
aged. The goal is to decrease subjectivity in interpreting applica¬ 
tion of the trial protocol [21]. Other details of dosing, co-therapy, 
and attainment of therapeutic targets should be clarified as well as 
contingency plans for side effects. It is important that conventional 
therapy be applied equally and effectively in both intervention and 
control groups. This will diminish the chances of extraneous fac¬ 
tors, other than interventions being tested, influencing the primary 
outcome. This may lead to a lower than expected event rate, a 
common occurrence in well conducted RCTs. 

Investigators should strive to design trials representative of the 
relevant clinical practice. Thus, inclusion criteria are important to 
ensure the study question is answered in a population of subjects 
similar to that in which the results need to be applied. For logistic 
reasons participants must be residing in a location amenable to 
follow up. People who are unable to consent should be excluded. 
Of course exceptions exist. These may include studies which 
involve pediatric, dementia, or intensive care patients where 
third-party consent would be more appropriate. Furthermore, 
investigators should try to ensure that patients unlikely to benefit 
from the treatment or prone to side effects from the intervention 
do not contaminate the study population. For example, in studies 
of chronic disease it would mean excluding participants with other 
concomitant diagnoses conferring poor prognosis, such as active 
cancer. Finally, some people are excluded for general safety or ethical 
considerations, such as pregnant participants. Clear definitions of 
inclusion and exclusion criteria are essential. 

The primary outcome is the most important measure as this provides 
the answer to the principal research question. Secondary outcome 
measures may be used to evaluate safety, such as mortality and 
comorbidities, or additional effects of the intervention. It is impor¬ 
tant that these measures are specified before the trial is underway 
and that there are not too many. This prevents post hoc analysis 
from diminishing the conclusions and interpretation. For example, 
once a trial is completed and data collected on many outcomes at 
various follow-up times, it would not be appropriate to analyze all 
the data points to see which outcome was significantly different. 
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3.10 Measuring 
the Outcome 
Measures at 
Follow-Up 


This is because the more the data are analyzed, the more likely it is 
to find a statistically significant result just by chance (commit a type 
I error). This effect of multiple analyses would also limit pre- 
specified secondary outcomes. Again, clear definitions of primary, 
secondary, and tertiary outcomes are essential. 

Investigators need a relevant, valid, precise, safe, practical, and 
inexpensive means of judging how a particular treatment affects 
the outcome of interest [22]. For example, in chronic kidney dis¬ 
ease hard end points such as death or dialysis are preferred because 
of their uniform definition and objectivity. However, it should be 
remembered that kidney disease progresses at variable rates and the 
incidence of these advanced end points may be too low in early 
stage kidney disease. Using surrogate markers, studies may be con¬ 
ducted with smaller sample sizes and over a shorter period. The 
principal drawback of surrogate markers is their imperfect relation¬ 
ship to hard end points. Nonetheless, for chronic kidney disease 
for example, it has been suggested to use “intermediate” end 
points such as both doubling of serum creatinine and reduction in 
proteinuria by at least 30 % from baseline [22]. Although less accu¬ 
rate, these measures are easy to assess, more practical than more 
cumbersome measures such as Inulin clearance, and acceptable to 
some regulators, such as the Food and Drug Administration. 


4 Size and Duration of Trial 

4.1 Estimating Investigators need to know how many participants are required to 

Sample Size enter in an RCT to have a good probability of rejecting the null 

hypothesis for a clinically meaningful effect of the intervention and 
to have confidence that the conclusion is true. Because trials test a 
hypothesis on a subset (sample) of the whole population, it is 
possible to have results that are skewed due to chance depending 
on the sample population. A type I error occurs when the results 
from the sample population permit investigators to reject a null 
hypothesis which in fact is true. Type II errors occur when investi¬ 
gators fail to reject the null hypothesis when in fact the null hypoth¬ 
esis is false. The risk of committing a type I error is denoted by 
alpha and the risk of committing a type II error is denoted by beta. 
It has become convention to accept a 5 % chance of committing a 
type I error. Failing to reject a true null hypothesis is sometimes 
called the confidence, equal to one minus alpha (Table 1). The 
probability of rejecting a false null hypothesis is determined by the 
power of the RCT. Power is equal to one minus beta, and RCTs 
are usually designed to have a power of 80-90 %. 

An estimate of the likely event rate in the control group, from 
prior studies or preliminary data, is required to calculate sample 
size. The clinically relevant effect size due to the intervention is 
chosen for investigation. When the primary outcome is a continuous 



Table 1 

Type I and type II errors in a clinical trial 
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True state in the population 

True null hypothesis 

False null hypothesis 

Trial decision 

Reject 

Type I error 

Correct decision 


Null hypothesis 

Probability = alpha 

Probability = 1—beta (=Power) 


Do not reject 

Correct Decision 

Type II Error 


Null hypothesis 

Probability = 1—alpha 

Probability = beta 


variable the sample size estimation depends on the expected mean 
and variability (standard deviation) in the control group. A deci¬ 
sion is made to use a one-tailed or two-tailed test when comparing 
the outcomes in the intervention and control groups. Usually a 
two-tailed test is used because it is possible that the intervention 
will cause harm as well as benefit. 

Computational programs are available to calculate the sample 
size which take account of type I and type II error, the event rate 
in the control group, and the effect size to be studied. Sometimes 
overlooked, but very important, is underestimation of loss to fol¬ 
low-up, and dealing with missing or incomplete data when the trial 
ends. This must be anticipated beforehand and incorporated into 
sample size estimation. A practical “rule-of-thumb” is to allow for 
a minimum of 20 % loss. Finally, the sample size also depends on 
how the data is to be analyzed, whether by intention-to-treat or by 
treatment received. Usually the intention-to-treat analysis is the 
primary analysis. 

There are several options available to limit sample size when 
designing a trial. These include recruitment of subjects at higher 
risk for an outcome event, thus increasing the event rate, but doing 
so affects the generalizability of trial results. Another option is to 
use composite outcomes. Individual components of composites 
may be uncommon, but together the rate of composite events may 
be high enough to limit the sample size required. Components of 
composite outcomes include events that share a likelihood of ben¬ 
efiting from the intervention under study. For example, a trial 
might seek to determine the effect of a lipid-lowering drug on 
future myocardial infarction, revascularization, or cardiovascular 
death. However, the impact of therapy on individual components 
of the composite may vary, and not all components of the compos¬ 
ite are likely to be of equal clinical importance. 

4.2 Recruitment Again, experiences in prior studies, or preliminary data, are required 

Rate to estimate recruitment rate. It must be stressed that the number 

of participants willing or able to enroll will be lower than those 
who are eligible. Reasons such as potential participant’s location, 
enrollment in other trials, unwillingness or inability to consent, 
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4.3 Duration of the 
Treatment Period 


simply being overlooked and other physicians hesitant to enter 
their patients in a particular trial all decrease recruitment rate often 
to less than 20 % of the eligible population. 

Along with time to recruit an adequate sample, the rate and timing 
of the primary outcome events may affect the length of the trial. 
Additionally, the treatment or intervention may not begin when 
participants are enrolled into the trial. Depending on the protocol 
there may be run-in periods where participants are monitored for 
further exclusion to ensure eligibility, or time may be needed for 
wash-out periods of medications. Similarly, the treatment may end 
before the trial finishes as time is necessary to follow the participants 
for long-term outcomes. For example, patients with idiopathic 
membranous nephropathy and nephrotic syndrome were randomly 
assigned to receive symptomatic therapy or a treatment with meth- 
ylprednisolone and chlorambucil for 6 months and clinical out¬ 
comes were then determined up to 10 years later [23]. Finally, 
treatment periods may end when certain outcomes are met, such as 
transplant, but participants need follow-up for other end points, 
such as death. 

It helps to have a flow diagram (Fig. 2) outlining the streaming 
of participants through the processes of screening, consent, further 



Fig. 2 Example of flow diagram for participants in multicenter randomized 
controlled trial 
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assessment of eligibility, contact to trial center, stratification, 
randomization, and follow-up. This will aid in keeping track of 
participants in the trial and accounting for them during interpreta¬ 
tion and presentation of results. 


5 Trial Data 

5.1 Data Collection 
and Management 


5.2 Details 
of the Planned 
Analyses 


5.3 Planned 
Subgroup Analyses 


5.4 Frequency 
of Analyses 


The database to be analyzed will need to be designed ahead of time 
and corresponding data collection and collation methods must be 
accurate but easy-to-use. The data should be collected in a timely 
fashion. Data entry is becoming more sophisticated with current 
methods ranging from centralized entry by clerks using the data 
collection forms to Web-based entry at a remote collection or clini¬ 
cal site. Methods to error check or clean the data must be orga¬ 
nized. The final data must be kept secure. 

Appropriate analysis and statistical methods to test the hypothesis 
depend on the question and must be chosen during the planning 
phase. Intention to treat analysis is a method where participants are 
analyzed according to their original group at randomization. This 
technique attempts to analyze the data in a real world fashion with¬ 
out adjusting for drop-out or drop-in effects. Per protocol analysis, 
or analysis by treatment received, attempts to analyze the groups 
according to what the actual treatments were. This way, for exam¬ 
ple, if a participant drops-out early in the trial, they are excluded 
from the analysis. Otherwise, the analysis depends on the type of 
variable, whether repeated measures are used or if “time to event” 
outcomes are important. The trial design will also affect analysis 
depending on confounders, stratification, multiple groups or inter¬ 
actions in the case of factorial designs. 

Subgroup analysis is that done for comparison of groups within the 
main cohort, for example, diabetics versus nondiabetics in chronic 
kidney disease. Although not as informative as the whole group 
analysis because sample sizes may be inadequate, important infor¬ 
mation may be obtained. It is better to decide upon limited sub¬ 
group analysis during trial design than after the data have been 
collected. The problem in interpreting subgroup analysis is the 
higher risk of obtaining apparently statistically significant results 
that have actually arisen due to chance. 

There may be reasons to analyze a trial while underway. Safety 
monitoring is probably the most frequent reason. Sometimes dur¬ 
ing interim analyses the outcomes may be significantly statistically 
robust (either much better or much worse) than expected such 
that the continuation of the trial is no longer necessary [24]. In 
general, statistical stopping rules are used in this circumstance to 



170 


Bryan M. Curtis et al. 


5.5 Economic Issues 


5.6 Audit Trail 


6 Challenges to Trial 


6.1 Rate of Loss 
to Follow-Up 


preserve the originally chosen final alpha by requiring considerably 
smaller p-values to stop the trial [25]. Another reason for interim 
analysis is when investigators are uncertain of likely event rates in 
the control group. Interim analyses that do not address the 
primary study question do not affect trial type I error rate. 
Combined event accumulation, or defined intervals or dates, may 
dictate the precise timing of the interim analysis. An example of a 
trial halted by the data safety committee is the Beserab erythropoi¬ 
etin trial [26]. That prospective study examined normalizing 
hematocrit in patients with symptomatic cardiac disease who were 
undergoing hemodialysis by randomizing participants to receive 
epoetin to achieve hematocrit of 42 vs. 30 %. The study “was halted 
when differences in mortality between the groups were recognized 
as sufficient to make it very unlikely that continuation of the study 
would reveal a benefit for the normal-hematocrit group and the 
results were nearing the statistical boundary of a higher mortality 
rate in the normal-hematocrit group [26].” In addition, the inter¬ 
vention group had a highly significant increased rate of vascular 
access thrombosis and loss indicating that the intervention group 
was exposed to harm. 

Economic issues are a reality when it comes to changing practice 
patterns and health care policy. It helps if an intervention can be 
shown to be cost-effective. The methods are reviewed in another 
chapter. In general, data on resource use is collected prospectively 
as part of the RCT. Of note, it may be easier to acquire funding 
from government or hospital sources for the RCT if the outcome 
is likely to be in their financial interest in the long term. 

For quality control and scientific integrity, investigators must plan 
for ongoing record keeping and provide an audit trail. Methods to 
check for and correct errors should be done concurrently. Similarly, 
investigators will need to retain records for future access if needed. 
The length of time for retention is determined by requirements of 
local or national regulatory agencies, Ethics Review Boards, or 
Sponsors. 


Integrity 

One goal of the clinical trial is to estimate the treatment effect, and 
a well-executed trial should be a model for the real world setting. 
This section will focus on aspects of trials that will diminish the 
ability of the trial to detect the true treatment effect. 

Similar to recruitment, where it is not always possible to enroll 
eligible participants, it is not always possible to keep participants in 
trials once enrolled. People may “drop-out” because they become 
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disinterested or disenchanted, or they are lost because of relocation, 
adverse event or death. There are many strategies for preventing 
this, which depend on the reason participants are lost. Reminders, 
incentives or regular contact with participants helps keep their 
interest but this must be done without being too intrusive. Similarly, 
the trial team must be accessible with efficient and effective mea¬ 
sures in place to alleviate concerns participants may have. To deal 
with relocation or death, it is prudent to have permission from 
participants to get collateral information from Vital Statistics 
Bureaus or from relatives. 

Another form of “drop-out” has been historically termed non- 
compliance. In this situation participants are still being followed, 
yet they are not adhering to the trial intervention. This especially 
contaminates the active treatment group and may decrease the 
apparent effectiveness of an intervention. Strategies that have been 
utilized to reduce this include reinforcement, close monitoring 
and pill counting. 

Similarly, participants in a control arm may inadvertently, or 
otherwise, receive the experimental treatment. This is termed 
“drop-in.” These phenomena combined act like a crossover design 
where the investigators are either unaware of its occurrence or may 
be aware but unable to control it. There is no easy way to deal with 
this but most investigators will analyze the groups as they were 
originally assigned (i.e., intention to treat analysis). It is more 
important to identify areas where this may occur during trial design 
and try to prevent it. Keeping in mind that some loss is inevitable, 
the importance of considering these issues when estimating sample 
size during trial planning is again stressed. 

Centers may affect trials in similar ways. They too may “drop¬ 
out” and this must be taken into consideration when planning. 
Prior and ongoing collaboration may decrease this occurrence; 
however, different center case mix may dictate that some centers 
will not be able to continue in the trial. For example, they may not 
have enough potential participants. Otherwise similar strategies to 
keep participants may be used to keep centers involved. 

6.2 Methods 
for Protecting Against 
Other Sources of Bias 


Standardization of training, methods, and protocol must be under¬ 
taken as rigor in these areas decreases variability [27]. Similarly the 
use of a single central laboratory in multicenter studies can lessen 
variability in biochemical tests. 


7 Funding 


The research question must be answerable in a reasonable period 
of time for practical reasons. Costs will be influenced by the sample 
size, time to recruit participants, time needed for outcomes to 
accrue, the intervention, the number of tests and interviews, 
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managing and auditing the trial, and data management and analysis. 
A budget encompassing all these issues is essential. 

7.1 Costs The majority of today’s clinical research is human resource intensive 

and funding is required for the expertise of research nurses and 
assistants. Similarly, money is needed to pay for research manage¬ 
ment, data entry and other staffing necessary for administration of 
the trial. Employee benefits such as pensions, sick leave, compas¬ 
sionate, maternity/paternity leave, jury duty, and vacation must be 
factored in to costs. Costs of items such as paper, fax, phone, travel, 
computers, and other consumables can be significant. 

7.2 Licensing Licensing is required to establish new indications for drugs or 

techniques. This increases costs for trials as regulations for how 
these trials are conducted may be very strict, and for safety reasons, 
monitoring and quality control can be more intense. However, 
there is usually an increase in industry sponsorship to help alleviate 
the increased cost. 

7.3 Funding Sources Procuring funding is sometimes the major hurdle in clinical research. 

Fortunately, it is easier to find money to do topical research. For 
more expensive trials it may be necessary to obtain shared or lever¬ 
aged funding from more than one of the following limited sources. 
Public funding from foundations or institutes such as The National 
Institutes of Health in the USA is available. These have the benefit 
of the applications being peer reviewed. Government and Hospital 
Agencies have funds available for research but they can be tied to 
quality improvement or research aimed at cost reduction. They may 
also contribute “in-kind” other than through direct funding by 
providing clinical space and nursing staff. 

It is becoming increasingly difficult to do major clinical research 
without the aid of the private sector. Issues of data ownership and 
publication rights must be addressed during the planning phase, as 
should the relative responsibilities of the applicants and the spon¬ 
sor. It is usually preferable for the investigator to approach the 
private funding source with the major planning of the trial already 
completed. 


8 Details of the Trial Team 

8.1 Steering Larger trials may need a steering committee for trial management. 

Committee The main role is to refine design details, spearhead efforts to secure 

funding, and work out logistical problems. This may also include 
protocol organization and interim report writing. The committee 
may include experts in trial design, economic analysis, data 
management, along with expertise in the various conditions being 
studied. 
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8.2 Trial Manager 


8.3 End-Point 

Adjudication 

Committee 


8.4 Data Safety 
and Monitoring 
Committee 


8.5 Participating 
Centers 


Clinical trials require efficient trial management. This ensures finite 
human and financial resources are utilized efficiently, and in a 
timely manner. A dedicated trial manager can help improve the 
success of trial completion [28]. 

An end-point adjudication committee may be utilized to judge 
clinical end points and is required when there are subjective ele¬ 
ments of decision-making or when decisions are complex or error- 
prone. Criteria for the clinical outcomes have to be prespecified 
and applicable outside of a trial setting. The committee should 
include physicians and experts in the respective disease being stud¬ 
ied and from specialties related to comorbid end points being 
assessed. They should not be otherwise involved in the trial and 
should be blinded with respect to the participants’ intervention. 

A data safety and monitoring committee is needed for trials if there 
is a possibility that the trial could be stopped early [29]. This may 
be based on interim analysis when sufficient data may exist to pro¬ 
vide answer to the principal research question or when unexpected 
safety concerns emerge. This is especially relevant when outcomes 
of the trial are clinically important. Another reason would be 
because of the cost of continuing the trial when interim analysis 
suggests that continuation of the trial would be futile [30]. This 
committee, similar to the end-point adjudication committee, 
should be made up of experts not otherwise involved in the trial 
and should include specialists from the relevant disciplines and a 
statistician. The committee may not be provided with group com¬ 
parisons in some meetings but will meet to consider other issues 
related to trial quality. They should consider external data that may 
arise during the trial in considering termination. For example, if an 
ongoing trial is testing Drug A versus Placebo and another trial is 
published showing conclusive evidence that Drug A, or withhold¬ 
ing Drug A, is harmful then the ongoing trial may be terminated 
(even before interim analysis). This occurred in The Reduction of 
Endpoints in NIDDM with the Angiotensin II Antagonist Losartan 
(RENAAL) trial [31] following publication of the HOPE study 
[12] after which it was judged unethical to withhold therapy aimed 
at blockade of the renin-angiotensin system to patients on conven¬ 
tional treatment. 

If multiple centers are required then each center needs a responsi¬ 
ble investigator to coordinate the trial at their center. This may 
include applying to local ethics boards, coordinating local staff, 
screening, consenting, enrolling, and following participants. A let¬ 
ter of intent from each center’s investigator is usually required to 
secure funding. 
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9 Reporting 


Accurate reporting of RCTs is necessary for accurate critical 
appraisal of the validity and applicability of the trial results. All trials 
should initially be registered in a clinical trials registry, such as 
clinicaltrials.gov, allowing for transparent public access to informa¬ 
tion. Registration is required by law in certain countries and can be 
a requirement for publication in many journals. Trial registries also 
help to address publication bias, especially in the case of negative 
or subsequently unpublished trials [32]. 

The CONSORT (Consolidated Standards of Reporting Trials) 
Statement, revised in 2010 [33] contains a 25 item checklist and 
flow diagram. Use of this guidance was associated with improved 
quality of reporting of RCTs [34]. Twenty-five percent of RCTs 
involve non-pharmacological treatment, which require more 
extensive reporting particularly in relationship to the experimental 
treatment, comparator, care processes, centers and blinding. 
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Chapter 10 


Randomized Controlled Trials 2: Analysis 

Robert N. Foley 

Abstract 

When analyzing the results of a trial the primary outcome variable must be kept in clear focus. In the analysis 
plan consideration must be given to comparing the characteristics of the subjects, taking account of differ¬ 
ences in these characteristics, intention to treat analysis, interim analyses and stopping rules, mortality com¬ 
parisons, composite outcomes, research design including run-in periods, factorial, stratified, and crossover 
designs, number needed to treat, power issues, multivariate modeling, and hypothesis-generating analyses. 

Key words Randomized controlled trials, Analysis, Intention to treat, Research design, Stopping 
rules, Multivariate modeling 


1 Introduction 


The objective of this chapter is not to formally explore the mathe¬ 
matics of statistical testing or the pluses and minuses of different 
statistical software packages. The intent is to focus on selected ana¬ 
lytical considerations that may help one to decide whether the evi¬ 
dence in a given randomized trial is valid and useful in real-world 
clinical practice. For the most part, the chapter discusses issues 
related to trials with outcomes that are clinically meaningful and that 
are assumed to lead to permanent changes in health status, where a 
definitive result would be expected to change clinical practice. 


2 What Is the Primary Study Question? 

Therapeutic uncertainty is the basis of randomized trials and most 
trials are designed to answer a single question. When analyzing the 
results of a trial, the primary hypothesis and the primary outcome 
variable should be kept in clear focus. The nature of the question 
should be unambiguous and should immediately suggest major 
design elements and appropriate methods for statistical analysis. 
For example, in the primary report of the Diabetes Control and 
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Complications Trial (DCCT) [1], the abstract states the following: 
“Long-term microvascular and neurologic complications cause 
major morbidity and mortality in patients with insulin-dependent 
diabetes mellitus. We examined whether intensive treatment with 
the goal of maintaining blood glucose concentrations close to the 
normal range could decrease the frequency and severity of these 
complications.” If one takes the purist approach that there can 
only be a single primary outcome, this description is ambiguous; 
for there to be a single primary outcome, the description suggests 
the possibility that a composite outcome was employed in which 
the outcome was the first occurrence of any microvascular or neu¬ 
rological complication; equally well, the total number of such 
complications occurring in a defined period of time is also compat¬ 
ible with the terminology used. Needless to say, the analysis and 
reporting of these two outcomes are very different. In the first 
case, time-to-first-event analysis might be employed, whereas a 
rate-based analysis (allowing multiple events to be counted) might 
be used in the second scenario. In the introduction to the article, 
the following statement is made, which helps considerably to clar¬ 
ify the primary intent of the trial: “Two cohorts of patients were 
studied in order to answer two different, but related, questions: 
Will intensive therapy prevent the development of diabetic reti¬ 
nopathy in patients with no retinopathy (primary prevention), and 
will intensive therapy affect the progression of early retinopathy 
(secondary intervention)? Although retinopathy was the principal 
study outcome, we also studied renal, neurologic, cardiovascular, 
and neuropsychological outcomes and the adverse effects of the 
two treatment regimens.” It is clear, then, that two randomized 
trials were performed in parallel, and retinopathy was the primary 
study outcome in both. The statement also makes it apparent that 
the study will most likely use time-to-first-event analysis as the 
main analytical tool. 

In the empirical sciences, hypotheses can never be proven. 
When reading a trial report, or when deciding an analysis plan, it is 
often worth spending a little time on a formal enumeration of the 
null and alternate hypotheses. For example, while the null and 
alternate hypothesis were not formally reported in the DCCT pri¬ 
mary prevention trial study report, the statistical approach makes it 
clear that these were as follows: 

Null hypothesis: retinopathy with intensive treatment = retinopa¬ 
thy with standard treatment 

Alternate hypothesis: retinopathy with intensive treatment^ 

retinopathy with standard treatment 

Laid out in this fashion, an important analytical issue is immedi¬ 
ately addressed. In statistical terms, this is a two-tailed hypothesis. 
In other words, should intensive treatment truly worsen the primary 
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outcome, this will become apparent in the primary analysis. The 
principal attraction of a one-tailed design is lower sample size 
requirements than that of a two-tailed equivalent. Had a one-tailed 
design been used in the DCCT trial, the null and hypotheses might 
have been as follows: 

Null hypothesis: retinopathy with intensive treatment not better 

than retinopathy with standard treatment 

Alternate hypothesis: retinopathy with intensive treatment better 

than retinopathy with standard treatment 

If, in reality, the experimental treatment proves to be worse 
than the standard treatment, a one- tailed design will not lead to 
rejection of the null hypothesis. Ideally, the design and reporting 
of randomized trials should specify clearly whether one-tailed or 
two-tailed hypotheses are the basis of the trial. When confronting 
a one-tailed trial with neutral outcomes, one should immediately 
ask the question: even though A is not better than B, could B be 
better than A? 

Another situation in which it may be useful to formally write 
down the null and alternate hypothesis is when an equivalence 
design is used. With the familiar standard comparative designs typi¬ 
cally used in double blind, placebo-controlled trials, the null 
hypothesis is that no difference between treatments exists, whereas 
the alternate hypothesis is that a difference exists. In contrast, with 
equivalence designs, the null hypothesis is that a minimum pre¬ 
defined difference exists between treatments, whereas the alternate 
hypothesis is that no difference exists. 


3 What Are the Characteristics of the Trial Subjects? 

It seems intuitively obvious that one should know the maximum 
amount possible about the subjects included in a given random¬ 
ized trial. It is difficult to generalize study findings to other popu¬ 
lations and to individual patients without detailed descriptions of 
the study subjects. Secondly, as discussed below, randomization 
of therapies is rarely perfect with regard to characteristics that 
increase the chances of developing the study outcome during the 
trial, quite apart from treatments assigned during the randomiza¬ 
tion process. 

Key pieces of information that are rarely quoted in trial reports 
are the proportion of potentially eligible subjects available at the 
study sites, the proportion approached and the proportion of sub¬ 
jects that ultimately enter a trial. Clearly, while adding complexity 
to the logistics of the trial, all efforts to deliver this information 
should be made in the planning phase of the trial, as retrospective 
efforts to obtain these data are often fruitless. 
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With perfect randomization, known and unknown characteris¬ 
tics are identical in all treatment arms. In practice, even in large 
trials with careful randomization procedures, the likelihood that 
one baseline characteristic will be statistically different increases 
with the number of reported characteristics. It follows that the 
fewer the number of reported characteristics, the lower the likeli¬ 
hood that any single characteristic will differ between the treat¬ 
ment arms. As discussed above, a strategy of reporting fewer study 
subject characteristics lowers the generalizability of the study. 
If one accepts that knowing as much as possible about the study 
population is inherently better than not, one must accept the pos¬ 
sibility of unearthing statistically significant differences between 
the groups. This situation is not irredeemable, however, as it is easy 
to adjust for imbalances in baseline characteristics. When apprais¬ 
ing a clinical trial, it is critical to inspect whether differences 
between treatment arms are present. If so, outcome analysis should 
necessarily adjust for this imbalance. It is equally critical to assess 
whether important clinical descriptors have been omitted. 


4 Intention-to-Treat Analysis 

Though perhaps an unfortunate terminology, it is used extensively 
in the randomized trials literature. Essentially, the philosophy 
behind the term is that outcome analysis will be based entirely on 
random treatment assignment. As the latter is completely deter¬ 
mined by chance, “analysis by assigned treatment” might be a bet¬ 
ter description. Essentially, a black box is placed around all 
information accruing between randomization and assessment of 
the primary study outcome. One of the main advantages of this 
approach is the likelihood that unplanned occurrences (like cross¬ 
overs between treatments, noncompliance, co-interventions) make 
it less likely that differences between treatments will be seen, so 
that the intention-to-treat philosophy is conservative in advancing 
the case of new therapies. In randomized trials designed to deter¬ 
mine whether treatments lead to differences in clinical outcomes, 
intention to treat analysis is the gold standard. All other approaches, 
including analysis restricted to those who remain throughout on 
assigned therapy, should be viewed as subsidiary and inadequate in 
isolation. 


5 Interim Analyses 


Even in large clinical trials, sample size and event rate projections are 
often based on guesswork. In addition, new interventions can have 
unexpected, and sometimes life-threatening, side effects. Equally 
well, the intervention may lead to more dramatic improvements in 
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the primary outcome than originally expected. It is possible, then, that 
a difference between treatments could be present much earlier than 
originally planned in a trial. As a result, interim analyses of the primary 
outcome are usually planned in latter-day large clinical trials. 

While the conceptual basis for the stopping rules used in 
interim analyses (mainly the problems of dealing with increasing 
probability of a false-positive result with increasing numbers of pri¬ 
mary outcome analyses) are relatively straightforward, confusion 
still arises, perhaps because of the unfamiliarity of terminology 
used. For example, the terms “group sequential methods” and 
“alpha spending functions” are not intuitively helpful to non-stat¬ 
isticians. The first of these terms, “group sequential methods,” is 
used to indicate that the group of patients, and/or endpoints stud¬ 
ied at each interim analysis, is likely to have changed, either because 
new patients have entered the study since the last interim analysis 
or because new events have accrued. Alpha is the probability level 
below which a difference between treatments will be accepted, and 
this is usually set at 0.05. Alpha spending means that, while the 
alpha level used at each analysis can vary, the overall planned alpha 
level (or type I error rate) remains constant, typically at 0.05 in 
most trials. 

Several alpha spending methods exist, sharing the common 
properties that the number of planned interim analyses needs to be 
specified in advance and that analyses are roughly equally spaced. 
The Pocock method uses the same critical P-values for all analyses, 
including the final analysis [2]. For example, in the absence of 
interim analysis, a two-sided test would generally have a critical 
value 1.96 for an alpha level of 0.05. With the Pocock method, 
critical values might be 2.178 with one interim analysis and 2.413 
with five interim analyses. Thus, using the Pocock method means 
that the P-value to reject the hypothesis at the final analysis is con¬ 
siderably lower with greater numbers of interim analyses. The 
Haybittle-Peto approach uses a very conservative critical value for 
all interim analyses and then a value close to the usual value of 1.96 
at the final analysis [3]. The critical boundary values with the 
O’Brien-Fleming approach are designed in such a way that the 
probability of stopping early increases with the amount of informa¬ 
tion available. For example, with one planned interim analysis, the 
critical values would be 2.782 at the interim analysis and 1.967 at 
the final analysis. With five interim analyses, the critical values 
would very between 4.555 at the first interim analysis and 2.037 at 
the final analysis [4]. 

Figure 1 is a schematic example of using the O’Brien and 
Fleming method in a trial planning four interim and one final anal¬ 
ysis. Notable features of this scheme include the fact that the test 
statistic value and associated P-value for the final comparison are 
broadly similar to those seen when no interim analysis is planned. 
In contrast, the absolute magnitude of the test statistic is larger and 
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Test statistic 



Fig. 1 Hypothetical example of O’Brien and Fleming boundaries in a trial with four planned interim analyses 
and one final analysis 


the boundary P -value smaller with ever earlier interim analyses. 
It is also notable that, while the trial can be stopped for futility, this 
becomes apparent in later phases of the trial. 

None of these stopping rules designs allow for the possibility 
that an early disadvantage with a given treatment may be more 
than counterbalanced in the long-term, leading to net benefit. For 
example, in the secondary prevention arm of the DCCT trial, the 
intensive treatment group had a higher cumulative incidence of 
retinopathy in the first 2 years of the trial. By 9 years, however, this 
initial disadvantage was more than counterbalanced, so that the net 
effect was a 54 % reduction in the incidence of retinopathy in the 
intensive treatment group, compared to conventional treatment 
[1]. It is also worth pointing out that these designs are typically 
predicated on the primary outcome value only. Typically, major 
unexpected side effects are not incorporated into stopping rules. 
Finally, these methods are based entirely on the probability that a 
difference exists between treatments and do not take the size of 
this difference into account. With large sample sizes, small absolute 
differences, can lead to trial termination. Early termination of trials 
can mean that cost-benefit and cost utility cannot be assessed. In 
addition, treatment side effects that develop late, such as malig¬ 
nancy, may be difficult to identify. 

When reporting trials that were terminated earlier than 
planned, it should be clear whether the determination was based 
on preplanned stopping rules, or whether other factors were 
involved. One high-profile study illustrates the confusion that can 
result when these matters are not fully clear. In an open-label trial, 
1,432 patients with chronic kidney disease and anemia, embarking 





Randomized Controlled Trials 2: Analysis 


183 


on erythropoietin therapy, were randomly assigned hemoglobin 
targets of 113 or 135 g/L, with a primary end-point of time to 
first occurrence of death or a major cardiovascular event [5]. Four 
interim analyses were planned using the O’Brien-Fleming alpha¬ 
spending boundary method. In the Results section the following 
statement is made: “The data and safety monitoring board recom¬ 
mended that the study be terminated in May 2005 at the time of 
the second interim analysis, even though neither the efficacy nor the 
futility boundaries had been crossed, because the conditional power 
for demonstrating a benefit for the high-hemoglobin group by the 
scheduled end of the study was less than 5 % for all plausible values 
of the true effect for the remaining data.” The first part of this state¬ 
ment makes it clear that neither efficacy nor futility boundaries were 
crossed. In other words, the primary outcome alone cannot have 
been used as the basis for stopping the trial. The word “because” is 
therefore inappropriate and the remaining description does not 
help us to understand why the trial was stopped, as there is no logi¬ 
cal connection with the previous information. Unfortunately, when 
read in isolation, the latter part of the statement sounds very much 
like the stopping rules were the basis for termination. The situation 
becomes even more unclear when one learns that the final analysis 
showed statistically different rates of the primary outcome accord¬ 
ing to random treatment assignment [5]. 


6 Mortality Comparisons, Even When Not a Primary Outcome 

Most trials examining discrete clinical outcomes use time to event 
analysis. Typically, every subject in the trial is followed until either 
the primary outcome or a censoring event occurs. In other words, 
every subject contributes two variables, time in the study and mode 
of exit (either with a primary outcome or censored without a pri¬ 
mary outcome). Censoring in clinical trials should be examined 
carefully. To begin with, the term “censoring” is not very intuitive. 
In practice, this refers to end-of-follow-up events other than the 
primary study outcome. For example, if primary outcomes are 
coded 1 and end-of-follow-up events 0, reaching the last day of the 
study without any clinical event occurring, leaving the study 
because of a major side effect of the study treatment and loss to 
follow-up are all treated identically. Similarly, death is a censoring 
event if it is not included in the primary outcome and it is conceiv¬ 
able that an intervention could improve the primary outcome, 
while shortening survival. For example, imagine a trial in which 
every patient had an identical duration of follow. For every hun¬ 
dred patients in the experimental group, 25 exit the study with a 
primary outcome, 50 die, and 25 exit without a clinical event; the 
corresponding values in the control group are 50, 25, and 25, 
respectively. Simple calculation shows that the intervention halves 
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primary outcome rates and doubles death rates. Thus, attention 
must be paid to the mode of study exit and formal comparison of 
death rates should be included in studies of clinical outcomes. 

Large disparities between the effect of a study treatment on the 
primary outcome and the effect on mortality can point to unex¬ 
pected side effects of a study treatment. For example, in the 
Helsinki Heart Study, 4,081 asymptomatic middle-aged men with 
dyslipidemia were randomly assigned to gemfibrozil or placebo 
and cardiac events were the primary outcome. As hypothesized, 
gemfibrozil reduced the incidence of cardiac events. However, 
death rates were unaffected by the intervention, which was surpris¬ 
ing, as one might have expected that a lower incidence of cardiac 
events would lead to a lower mortality rate. Ultimately, it was 
found that gemfibrozil-treated patients had higher rates of violent 
deaths, including suicide [6]. Thus, it could be argued that a 
reduction in total mortality is the only dependable evidence that an 
intervention effects clinical outcomes that should be expected to 
shorten survival. In practice, this level of evidence usually means 
very large sample sizes and prolonged follow-up. While this per¬ 
spective may seem somewhat extreme, failure to consider differen¬ 
tial mortality effects can lead to erroneous conclusions about the 
overall benefit of an intervention. 


7 Composite Outcomes 

With composite outcomes, it is important to assess whether the 
components included in the composite are biologically plausible. It 
is also important to question whether other potentially relevant 
clinical events have been excluded. While it is seems intuitively 
obvious that each component of the composite outcome should be 
analyzed in isolation, many study reports fail to do this. 

One study systematically reviewed the use of composite end¬ 
points in clinical trials published in major medical journals between 
1997 and 2001. Ultimately 167 original reports were reviewed, 
involving 300,276 patients where the composite primary outcome 
included all-cause mortality. 38 % of the trials were neutral for both 
the primary end point and the mortality component; 36 % reported 
statistically significant differences for the primary outcome mea¬ 
sure, but not for the mortality component; 4 % showed differences 
in total mortality, but not for the primary composite outcome and 
finally, 11 % showed differences both in total mortality and in the 
composite primary outcome [7]. Related to this, while the effect 
on composite outcomes can be neutral, the intervention may 
reduce one component of the composite outcome. In this scenario, 
it must necessarily be the case that intervention increases the risk of 
at least one of the other components of the composite outcome. 
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8 Trials with Open-Label Run-In Periods on Active Therapy 

Trials with open-label run-in periods require careful analysis and 
overall conclusions should never forget the run-in phase, even if 
this was followed by a well-performed placebo-controlled compar¬ 
ison phase. For example, one study examined the effect of carvedilol 
in 1,094 heart failure patients and found that carvedilol reduced 
death and hospitalization rates during the placebo-controlled por¬ 
tion of the trial [8]. In the initial open label, run-in phase, eligible 
patients received 6.25 mg of carvedilol twice daily for 2 weeks and 
patients tolerating carvedilol were then assigned to receive 
carvedilol or placebo. The abstract, which is probably the section 
of a publication with the most relevance to clinicians, failed to 
report that the seven deaths occurred during the run-in period did 
not appear in the mortality comparisons, even though they 
accounted for 24 % of all the deaths in patients who received 
carvedilol. In addition, 1.4 % of patients were withdrawn prior to 
randomization because heart failure worsened during the run-in 
phase. It is probably best, therefore, to avoid this design in general, 
because it is not reflective of real clinical decision-making, given 
that the philosophy is predicated on the premise that early effects 
can be discounted in the assessment of overall benefit. This said, 
when analyzing data from such a trial, a conservative effect esti¬ 
mate can be generated by assuming that all patients who do not 
enter the placebo phase are considered to have exited the study 
because the primary outcome has occurred. 


9 Factorial, Stratified, and Crossover Designs 

9.1 Factorial Design It is often advocated that large clinical trials should employ facto¬ 

rial designs, because several interventions can be assessed at one¬ 
time. For example, with a typical 2x2 factorial design, subjects are 
randomized to none, either or both of the treatments A and B. It 
is possible that the effect of treatment A depends on the presence 
of treatment B, an example of an interactive effect. Failure to 
account for this interaction can lead to biased estimates of the 
effects of treatments A and B. In other words, the primary analysis 
should jointly included A, B and Ax B as exploratory variables. 

9.2 Stratified Design A similar problem arises when patients are randomized within 

strata. Stratification is often used in an effort to ensure balance 
across treatment groups of a dominant characteristic of the study 
population, often one thought to be highly predictive of the occur¬ 
rence of the primary outcome variable. It is possible that a study 
treatment could have differential effects in patients with and with¬ 
out this dominant characteristic. As with factorial designs, it is 
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important to include a treatment-by-stratum term when the effect 
of the treatment is analyzed in the overall group. In other words, 
the primary analysis should jointly included treatment, stratum and 
treatment x stratum as exploratory variables. In addition, even with 
the risks of multiple comparisons and the likelihood that individual 
strata may be inadequately powered, it is useful to analyze the 
effect of the intervention within the individual strata. 

9.3 Crossover Trials In spite of their conceptual simplicity, crossover trials are often ana¬ 
lyzed inappropriately in the medical literature. In crossover trials, 
enrolled subjects are given sequences of treatments and differences 
in the primary outcome between individual treatments are com¬ 
pared [9]. With this design, the intervention is randomization to a 
sequence of treatments, for example AB or BA. Because subjects 
act as their own controls, between-subject variation is eliminated 
and smaller sample sizes are required to detect a given difference 
between treatments, in comparison with standard parallel group 
designs. The principal problem with crossover designs is period- 
by-treatment interaction, commonly referred to as the carryover 
effect. Stated briefly, carryover is the persistence of a treatment 
effect from a single period into a subsequent period of treatment. 
The possibility of carryover is the reason washout periods are com¬ 
monly used in crossover trials. 

When analyzing crossover trials it is important to pretest the 
data for carryover effects [10]. In other words, the primary analysis 
should jointly included treatment, period and treatment x period as 
exploratory variables. Analysis becomes highly problematic if car¬ 
ryover effects are detected. One approach to dealing with this 
problem is to treat the study as a standard parallel group trial, con¬ 
fining the analysis to one of the study periods (usually the first 
periods is chosen). Obviously, the validity of this approach may be 
threatened by inadequate statistical power, as between subject vari¬ 
ability can no longer be assumed to be eliminated, and by the deci¬ 
sion to use one of two periods is arbitrary [11]. Another approach, 
applicable when >3 treatment periods are used (such as ABB/ 
BAA) is to model the carry over effect and to adjust treatment 
estimates for this effect [10]. 


10 Number Needed to Treat 

This is defined as the number of cases that have to be treated with 
an intervention to prevent a single occurrence of the primary study 
outcome [12]. It is easily computed as the inverse of the absolute 
risk reduction caused by the treatment and can be very useful in 
economic analyses, in particular. In analyses based on survival tech¬ 
niques, quoting a number in isolation is meaningless, and an appro¬ 
priate mode of description might include a measure of the average 
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or median duration of the trial. Similarly, annualizing this measure 
(as in saying X patients need to be treated per year) may not be 
accurate, as this approach implicitly assumes that the relative effect 
of the intervention does not vary with time. Finally, numbers 
needed to treat are often quoted without associated confidence 
intervals, which is meaningless from a statistical perspective. 


11 Neutral Trials: Power Issues 

When trials show no treatment effect, it is important to revisit the 
issue of sample size and Type II error. In particular, it is worth 
comparing the projected event rates with the actual event rates 
observed in the study, especially in the control group. It is also use¬ 
ful to produce numerical estimates of the potential treatment effect 
the study could have detected with standard power assumptions. 
As a corollary, it is also very useful to recalculate the sample size 
needed to detect the originally planned difference between treat¬ 
ments, using the primary outcomes seen in the study. 


12 Imbalanced Randomization of Baseline Characteristics 
and Treatment Comparisons 

Earlier in this chapter, the case is made that full and open disclo¬ 
sure of as many characteristics of the study population as possible 
is vital in assessing the applicability of the study results in other 
populations and in individual patients. With such an approach, it is 
to be expected that some baseline characteristics may not be evenly 
balanced between treatment groups, even with meticulous ran¬ 
domization procedures. If it turns out that these imbalanced char¬ 
acteristics are themselves associated with the primary study 
outcome, it is imperative that analyses are performed in which 
adjustment is made for these imbalances. 

Because no trials are perfect, it is hard to argue against a policy 
in which treatment estimates are subjected to a rigorous sensitivity 
analysis, with regard to adjustment for many potential of covariates. 

For example, in the CHOIR study, described above, despite 
the large sample size, statistically significant differences in two 
baseline characteristics were observed, namely, more hypertension 
and more prior coronary artery bypass surgery in the higher hemo¬ 
globin target group [5]. Bearing in mind that cardiovascular events 
were the primary outcome, it is be reasonable to ask “What hap¬ 
pens to the treatment effects, when adjustment is made for baseline 
covariates?” This analysis did not appear in the primary publica¬ 
tion. Interestingly, when reported in another forum, with adjust¬ 
ment for baseline characteristics, the P -value for the randomly 
assigned intervention changed from 0.03 to 0.11 [13]. With a true 
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treatment effect, statistical significance cannot be made to disap¬ 
pear with any adjustment strategy. In essence, therefore, the dis¬ 
parity between the adjusted analysis and the unadjusted analysis is 
an unequivocal demonstration that the null hypothesis for a true 
treatment effect cannot be safely rejected. 


13 Analysis of Randomized Trials from an Observational Perspective: 

Assessment of Hypothesized Risk Factors and Surrogates 

In some situations, it can be very helpful to combine control 
groups and intervention groups, and study the outcome associa¬ 
tions of the assigned therapy in an observational manner, much 
like one would do with a prospective cohort study. One ideal situ¬ 
ation for this approach is studies where subjects are assigned to 
target levels of biological variables. For example, in the Hypertension 
Optimal Trial (H.O.T) 18,790 hypertensive patients were ran¬ 
domly assigned to target diastolic blood pressures <90, <85 mmHg 
or <80 mmHg, and felodipine was used as primary antihyperten¬ 
sive therapy. In terms of the treatment experiment, the interven¬ 
tion had no effect on cardiovascular event rates. In contrast, when 
analyzed in a purely observational manner, patients with lower 
blood pressure levels during the course of the study had lower car¬ 
diovascular event rates [14]. The authors of study summarized the 
findings as showing “the benefits of lowering the diastolic blood 
pressure down to 82.6 mmHg.” While they clearly demonstrated 
that people with blood pressure levels above this level had higher 
cardiovascular events, they also showed that an unknown common 
factor was leading both to high blood pressure and high cardiovas¬ 
cular event rates. In other words, they convincingly showed that 
the observation between high diastolic blood pressure and cardio¬ 
vascular disease in the study population was noncausal. Disparities 
between assigned and observed longitudinal variables in random¬ 
ized trials strongly suggest the presence an unknown factor causing 
both the clinical observations and help prove that an epidemiologi¬ 
cal association is noncausal. This phenomenon might be termed 
observational-experimental discrepancy. 

Randomized trials are the ideal arena for assessing the validity 
of surrogate markers. For example, imagine the following causal 
pathway in which A leads to C via the development of B: 

A^B^C 

If this causal pathway is truly valid, the following effects should 
occur with an intervention that lowers A: 

1. The intervention lowers B. 

2. The intervention lowers C. 

3. When adjustment is made for the development of B, the inter¬ 
vention no longer lowers C. 
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14 Hypothesis-Generating Analyses 

Definition of study subjects, follow-up and assessment of clinical 
events are often highly detailed in good randomized trials. 
In addition, high-quality data are often available about outcomes 
other than the primary study question, and biological samples may 
be collected sequentially at regular intervals. It may be possible, 
then, to study many other outcomes with regard to randomly 
assigned treatment groups. It may be possible to examine the indi¬ 
vidual components of a composite outcome and even to reexamine 
the primary outcome many years after the experiment has stopped. 
Examination of outcomes in segments of time should be possible. 
In essence, the list of possible comparisons may be impressively 
long. All these analyses are subject to the problem of multiple com¬ 
parisons, such that it is a virtual guarantee that an apparently 
important difference between assigned treatment groups will be 
found, if one performs enough analyses. While there is nothing 
intrinsically wrong with examining multiple outcomes in high- 
quality data sets, these analyses should always be considered as 
hypothesis generating, at best. Unfortunately, this limitation 
applies to all non-primary outcome analysis, whether or not these 
were enumerated in the planning phase of the study. 
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Chapter 11 


Randomized Controlled Trials 3: Measurement 
and Analysis of Patient-Reported Outcomes 

Michelle M. Richardson, Megan E. Grobert, and Klemens B. Meyer 

Abstract 

The study of patient-reported outcomes, now common in clinical research, had its origins in social and 
scientific developments during the latter twentieth century. Patient-reported outcomes comprise func¬ 
tional and health status, health-related quality of life, and quality of life. The terms overlap and are used 
inconsistently, and these reports of experience should be distinguished from expressions of preference 
regarding health states. Regulatory standards from the USA and European Union provide some guidance 
regarding reporting of patient-reported outcomes. The determination that measurement of patient- 
reported outcomes is important depends in part on the balance between subjective and objective outcomes 
of the health problem under study. Instrument selection depends to a large extent on practical consider¬ 
ations. A number of instruments can be identified that are frequently used in particular clinical situations. 
The domain coverage of commonly used generic short forms varies substantially. Individualized measure¬ 
ment of quality of life is possible, but resource intensive. Focus groups are useful, not only for scale devel¬ 
opment but also to confirm the appropriateness of existing instruments. 

Under classical test theory, validity and reliability are the critical characteristics of tests. Under item 
response theory, validity remains central, but the focus moves from the reliability of scales to the relative 
levels of traits in individuals and items’ relative difficulty. Plans for clinical studies should include an explicit 
model of the relationship of patient-reported outcomes to other parameters, as well as definition of the 
magnitude of difference in patient-reported outcomes that will be considered important. It is particularly 
important to minimize missing patient-reported outcome data; to a limited extent, a variety of statistical 
techniques can mitigate the consequences of missing data. 

Key words Patient-reported outcomes, Health-related quality of life, Quality of life, Functional status, 

Health status 


1 History and Definition of Patient-Reported Outcomes 

Over the past several decades, it has become common for clinical 
investigators to use forms and questionnaires to collect observers’ 
reports of human subjects’ function and experiences. These instru¬ 
ments measure perception and assessment. Implicitly or explicitly, 
the investigators hypothesize that such measurement may detect 
variation in the natural history of disease and treatment effects not 


Patrick S. Parfrey and Brendan J. Barrett (eds.), Clinical Epidemiology: Practice and Methods, Methods in Molecular Biology, 
vol. 1281, DOI 10.1007/978-1-4939-2428-8_11, © Springer Science+Business Media New York 2015 

191 





192 


Michelle M. Richardson et al. 


described by vital status or by observations recorded in the clinical 
record. The first instruments designed to assess function were 
devised in the 1930s and 1940s. These reports of functional status 
by an observer were not qualitatively different from a detailed 
account, using a controlled, and hence countable vocabulary, of 
selected aspects of the medical history. The ensuing nine decades 
have seen three clinical developments in the measurement of func¬ 
tion and experience in clinical investigation: the change in reporting 
perspective from third person to first person; the broadening of 
the phenomena of interest from function to quality of life and the 
subsequent definition of a narrower focus on health-related quality 
of life; and the merging of the tradition of clinical observation with 
that of psychometric measurement, which had developed through 
educational and psychological testing. 

The earliest instruments examined aspects of function closely 
related to impaired physiology, whether related to heart failure and 
angina (the New York Heart Association classifications), to malig¬ 
nancy and the consequences of its treatment (Karnofsky perfor¬ 
mance status), or to functional limitations as measured by 
independence in activities of daily living for aged, chronically ill 
individuals (e.g., the PULSES Profiles, the ADL Index, the Barthel 
Index). Development of survey technology by the US military for 
screening purposes during World War II gave a precedent for direct¬ 
ing questions to the individual rather than to clinicians or other 
observers [1]. Between the 1950s and 1970s, several trends stimu¬ 
lated measurement of quality of life at the societal level. One was 
the advent of public financing of health care and a second the devel¬ 
opment of extraordinary, intrusive, and expensive methods of life 
prolongation. These posed new questions: Was such prolongation 
desirable at all costs and under all circumstances? (The first indexed 
reference to “quality of life” in MEDLINE raised questions about 
dialysis treatment for chronic kidney failure [2].) The third trend 
contributing to interest in quality of life was the rise of a techno¬ 
cratic approach to public policy, accompanied by statistical argu¬ 
ment: Quality of life was a “social indicator,” a complement to 
economic measurement. Interest in it reflected in part ideological 
rejection of what was seen as materialism and emphasis on quality 
rather than quantity [ 3 ]. Finally, there was an optimistic emphasis 
on positive health rather than on the absence of disease, first 
expressed in 1946 in the Constitution of the World Health 
Organization [4]. 

In the USA in the 1970s and 1980s, publically funded studies, 
such as the Health Insurance Experiment and the Medical 
Outcomes Study, saw the development of health status measure¬ 
ment by scientists trained not primarily as clinicians but in the 
tradition of psychological and educational testing. They brought 
mathematical rigor and an emphasis on the theoretical under¬ 
pinnings of survey construction to measurement of health. 
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Two traditions of assessment merged: the categorical medical 
model, and the dimensional psychological and educational model 
[5]. Beginning in the 1950s, developments in statistics, psycho¬ 
metrics, and the technology of the measurement have fundamen¬ 
tally changed the understanding of patient-reported outcomes, 
not merely accelerating the process of data acquisition and analysis, 
but defining new concepts. Computerized adaptive testing use item 
response theory to adjust the items presented to a respondent on 
the basis of previous responses. Evolving statistical methods allow 
comparison of the relative validity of instruments and offer the 
prospect of being able to “crosswalk” results from one instrument 
to another, allowing comparison of populations across studies. 

The use of questionnaires to elicit reports about all aspects of 
experience, and the theoretical need to connect the tradition of nar¬ 
rowly focused biological inquiry with broader social concerns led to 
the formulation of a model in which the individual’s physical and 
mental functioning were at the epicenter of a family of concentric 
circles; physical aspects of health were measurable as functional sta¬ 
tus and physical and mental aspects as health status. Health status was 
also used to describe the union of these core issues with issues that 
involved both the individual and his or her immediate social setting: 
the influence of emotions and physical health on the individual’s role 
functioning, social functioning, and spirituality. The term health- 
related quality of life (HRQOL) is often used to describe aspects of 
quality of life directly related to health and to distinguish these from 
aspects of experience more indirectly related to the individual and 
more dependent on social and political trends. However, the term 
quality of life remains ambiguous shorthand: it refers in some 
contexts to health related aspects of experience but in others to 
valuation of those experiences. For example, one university center 
studying it defines quality of life as “The degree to which a person 
enjoys the important possibilities of his or her life” [6]. 


2 Experience and Preference 

Patient-reported outcomes are descriptions. These descriptions are 
measured by instruments that elicit the individual’s observation of 
his or her experience, sometimes experience in the world, some¬ 
times internal experience, sometimes anticipation of future experi¬ 
ence. The instruments used to elicit these quantitative descriptions 
represent a measurement tradition that can be traced back to 
Fechner’s nineteenth century psychophysics. This psychometrically 
derived description of the individual’s experience should be distin¬ 
guished from the individual’s relative valuation of outcomes or 
states of health. This valuation was classically described as the 
utility of an outcome to the individual; measures of this valuation 
are commonly described as preference-based. It is defined by the 
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standard reference gamble described by von Neumann and 
Morgenstern, in which one chooses between the certainty of an 
immediate outcome, on one hand, and a gamble between the best 
and worst possible outcomes, on the other. The utility of the inter¬ 
mediate outcome is defined as being equal to the probability of the 
best possible outcome at the point of indifference, the point at 
which one prefers neither the gamble between best and worst out¬ 
comes, nor the certainty of the intermediate outcome. For exam¬ 
ple, imagine that a chronically ill patient faces a choice between 
continued chronic illness and a treatment that will either return 
him to perfect health or result in his immediate death. He finds 
that at a probability of 0.7 of surviving the treatment and returning 
to perfect health, he is indifferent to the choice. The utility of life 
in his current state of chronic illness is 0.7. Because of the intuitive 
difficulty of the standard reference gamble, other approaches to 
assess the utility of health states have been explored, most success¬ 
fully, the time trade-off, in which the utility of the intermediate 
state of health is measured by the number of years of perfect health 
one would accept in exchange for longer survival in impaired 
health. Thus, if one is indifferent to a choice between 7 years of 
perfect health and 10 years of chronic illness, the utility of the state 
of chronic illness would again be 0.7. 


3 Regulatory Standards 

Results from reliable patient-reported outcome instruments 
originating in appropriately designed studies can be used to support 
claims of therapeutic benefit in medical product labeling. Labeling 
claims usually relate to patients’ signs and symptoms or to an aspect 
of functioning affected by the disease state. In its Guidance to 
Industry, the US Food and Drug Administration (FDA) suggests a 
specific and detailed process by which a patient-reported outcome 
instrument may be used to prove labeling claims. Four general 
guidelines require definition of and set standards for [7]: 

1. The population enrolled in the clinical trial 

2. The clinical trial objectives and design 

3. The PRO instrument’s conceptual framework 

4. The PRO instrument’s measurement properties 

The European Medicines Agency also provides guidance for 
quality of life research. In the European Union, efficacy and safety 
are the basis for drug approval; proving an improvement in quality 
of life is optional. The European Medicines Agency defines PROs 
as self-administered, subjective, multidimensional measures and 
specifies that domains must be “clearly differentiated from the core 
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symptoms of the disease,” such as pain. An application submitted 
claiming global HRQOL improvement must be accompanied by 
evidence showing robust improvement in all or most of the 
domains measured. The instrument or questionnaire used must 
specifically be validated in the condition in question. The observa¬ 
tion period must be sufficient to allow distinction of treatment 
effect on HRQOL from underlying short term fluctuation [8, 9]. 


4 Instrument Selection 

The most fundamental question is whether patient-reported out¬ 
comes are important to a study. An affirmative answer to the ques¬ 
tions of importance implies that one has a theory or model of the 
outcomes in question and their relationship to the disease or inter¬ 
vention. A paper from the European Regulatory Issues on Quality 
of Life Assessment Group suggests that health-related quality of 
life measurement may be helpful in the following scenarios [8]: 

• When one of more HRQOL domain(s) is critical for patients, 

• When there is no objective marker of disease activity (e.g., 
migraine, arthritis), 

• When a disease can only be characterized by several possible 
measures of clinical efficacy (e.g., asthma), 

• When a disease is expressed by many symptoms (e.g., irritable 
bowel syndrome), 

• When treatment extends life, possibly at the expense of 
well-being and HRQOL [...], 

• When the new treatment is expected to have a small or non¬ 
existent impact on survival [...] but a positive impact on 
HRQOL [...], 

• With highly efficient treatment in severe and handicapping 
diseases (e.g., rheumatoid arthritis) to ensure that improve¬ 
ment of severity score is accompanied by improvement of 
HRQOL, 

• With not very efficient treatment in less severe diseases (e.g., 
benign prostatic hypertrophy, urinary incontinence) to ensure 
that the modest improvement of symptoms is accompanied by 
improvement of HRQOL, 

• In diseases with no symptoms (e.g., hypertension) to ensure 
that treatment does not alter HRQOL, and 

• In equivalence trials, for drugs anticipated to result in a similar 
disease course, but expected to have HRQOL differences. 

If the primary or secondary outcomes of a study are patient- 
reported, it is important that the investigators make explicit such a 
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conceptual model [10]. In evaluating a specific candidate 
instrument, the following criteria have been suggested [ 1 ]: 

1. Will it be used to evaluate a program or to study individuals? 

2. What diagnoses, age groups, and levels of disability will be 
studied? 

3. Are short-term or long-term conditions to be studied? 

4. How broad and how detailed must the assessment be? 

5. Does the instrument reflect an underlying conceptual approach 
and is that approach consistent with the study? 

6. Is scoring clear? 

7. Can the instrument detect the changes in question? 

8. What evidence exists as to reliability and validity? 

In addition to the above questions, the FDA reviews the 
following characteristics of instruments [7]: 

• Concepts being measured 

• Number of items 

• Conceptual framework of the instrument 

• Medical condition for intended use 

• Population for intended use 

• Data collection method 

• Administration mode 

• Response options 

• Recall period 

• Scoring 

• Weighting of items or domains 

• Format 

• Respondent burden 

• Translation or cultural adaptation availability 

In choosing among instruments, the first question is whether 
to use a measure of general health-related quality of life alone or to 
supplement the generic instrument with items regarding experiences 
particularly relevant to the particular health problem under study. 
(It would be theoretically possible to ask only questions specific to 
the instant problem, but it is hard to imagine a situation in which 
the information elicited by a general instrument would not be 
important, if only to compare the study population to other popu¬ 
lations.) The reason to supplement a generic “core” instrument is 
that one has reason to believe, on the basis of prior published 
patient-reported outcome data, clinical experience, or preliminary 
investigation, that the generic instruments do not elicit important 



Randomized Controlled Trials 3: Measurement and Analysis of Patient-Reported... 197 

Table 1 

Instruments commonly used in specific clinical settings 


Disease 

Measures 

Coronary heart disease 

Minnesota Living with Heart Failure Questionnaire (MLHFQ) 

Seattle Angina Questionnaire 

MacNew Heart Disease Health-Related Quality of Life Questionnaire 
Kansas City Cardiomyopathy Questionnaire (KCCQ) 

Geriatrics 

Geriatric Depression Scale 

Life Satisfaction Index 

Kidney failure 

Kidney Disease Quality of Life—Short Form (KDQOL-SF) 

Cancer 

EORTC QLQ-C30 


Breast Cancer Chemotherapy Questionnaire 

Orthopedics/rheumatology Oswestry Disability Index 

Fibromyalgia Impact Questionnaire 


HIV 

Neck Disability Index 

Arthritis Impact Measurement Scale (AIMS) 

MOS-HIV 

MQoL-HIV (Multidimensional Quality of Life Questionnaire for HIV 
infection) 

Diabetes 

Diabetes Quality of Life Measure 


experiences or adequately detect change. Table 1 shows examples of 
instruments used in situations in which clinical studies commonly 
measure health-related quality of life. The empiric literature on 
whether disease-specific supplements add information to generic 
instruments is mixed and seems to vary by clinical situation [11]. 
The content that a “disease-specific” supplement adds to a general 
measure of health-related quality of life may be quite specific but 
may also include domains that might be considered of general 
importance but are not included in all general measures. For exam¬ 
ple, the SWAL-QOL, which assesses the effects of dysphagia, 
includes items regarding both the burden of swallowing problems 
and sleep [12]. 

Table 2 shows the domains included in seven commonly used 
short general measures of health-related quality of life. One family of 
instruments deserves particular note, the series of short forms mea¬ 
suring health status developed in the course of the Medical Outcomes 
Study. The Short Form-36 Health Survey is probably the most widely 
used measure of health-related quality of life worldwide. Its transla¬ 
tion has been the centerpiece of the International Quality of Life 
Assessment Project (www.iqola.org). SF-36 domains include physical 
functioning, role limitations due to physical health, bodily pain, gen¬ 
eral health perceptions, vitality, social functioning, role limitations 
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Table 2 

Domains included in commonly used short general measures 
of health-related quality of life 



NHP 

COOP 

DUKE 

EQ-5D 

WHOQOL-BREF 

SF-36 

Pain 

V 

V 

V 

V 

V 

V 

Physical functioning y/ 

V 

V 

V 

V 

V 

Mental health 

V 

V 

V 

V 

V 

V 

General health 


V 

V 

V 

V 

V 

Social functioning 

V 

V 

V 


V 

V 

Sleep 

V 




V 


Fatigue 

V 




V 

V 

Family 

V 


V 




Work 

V 


V 


V 

V 
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World Health Organization Quality of Life-BREF, SF-36 Short Form-36 Health Survey 


due to emotional problems, and mental health. It yields scale scores 
for each of these eight health domains and two summary measures of 
physical and mental health: the Physical Component Summary score 
(PCS) and Mental Component Summary score (MCS). The SF-6D, 
a derivative measure, allows assessment of utilities. 

In 1996, a second version of the SF-36 was produced, SF-36 
v2. Although designed to be comparable to the original form, 
SF-36 v2 rewords some questions and begins with specific instruc¬ 
tions. Several previously dichotomous questions were given more 
response options to improve fidelity; to streamline the question¬ 
naire, some six-item responses were reduced to five. Formatting 
was also standardized: all responses are now listed horizontally. 
The intent was that the SF-36 v2 be more easily used throughout 
the world. The SF-12, similarly revised in 1996, was developed 
from the SF-36 as a scannable one-page survey. The SF-12 has 12 
items rather than 36, takes about 2 rather than 7 min to complete, 
and offers results comparable to the SF-36. Numerous studies 
show PCS and MCS scores calculated from the two instruments to 
be comparable. The SF-12 is an ideal tool for population-level 
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work. However, the greater reliability of the longer short form 
makes the SF-36 preferable for interpretation of individual’s results, 
and some investigators argue that even it is too unreliable for this 
purpose. 

In part because of the large number of available surveys, the 
National Institutes of Health began an initiative called PROMIS® 
(Patient Reported Outcomes Measurement Information System). 
This system is intended to be a collection of highly reliable, precise 
measures of patient-reported health status for physical, mental, 
and social well-being. PROMIS® is a unique venture in that the 
measures have been standardized so there are common domains 
and metrics across conditions, allowing for comparisons across 
domains and diseases; all metrics for each domain have been rigor¬ 
ously reviewed and tested for reliability and validity; PROMIS 
items can be administered in a variety of ways, in a different forms; 
and PROMIS encompasses all people, regardless of literacy, 
language, physical function, or life course [13]. 


5 Individualized Measures of Patient Reported Outcomes 

The attempt to measure subjective experience, whether health status, 
health-related quality of life, or broader aspects of experience, 
arose at least in part as a reaction to the reduction of human experi¬ 
ence to economic and clinical measurements. However, some 
investigators argue that standardized questionnaires are inherently 
insensitive to the particular issues most important to individuals 
and the individual should be able to designate the domains impor¬ 
tant to him or her. McDowell describes this debate as one between 
nomothetic and ideographic approaches to knowledge [1]. The 
Schedule for the Evaluation of Individual Quality of Life (SEIQoL) 
is probably the most widely cited individualized instrument. 
Because of the complexity and resource requirements of measuring 
individualized quality of life, such instruments have not been 
widely used to define clinical trial outcomes [14]. 


6 Defining the Issues: Focus Groups 

The complexity and resource requirements of new instrument 
development make the use of existing instruments preferable 
whenever possible. However, even if the intent is to use existing 
instruments, it is important to confirm their context validity. Focus 
groups are often recommended but rarely described. Professional 
facilitators have experience in directing conversation flow and ana¬ 
lyzing qualitative results and will also assist in planning focus 
groups [15-18]. Focus groups should be relatively homogenous, 
and participants should be strangers to one another; this 
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encourages disclosure and minimizes assumptions in conversation. 
If sexuality or other sensitive information may be discussed, men 
and women should be separated. The optimal size is 7-10; too 
many participants leave some without a chance to talk, and too few 
create too much pressure to talk. Very few people should be in the 
room during the conduct of the group itself: only the facilitator, a 
note taker, and at most one or two members of the research team. 
Under no circumstances should authority figures, such as clinicians 
responsible for the care of group members, be present. To protect 
privacy, it is helpful to use pseudonyms and record only general 
demographic information. The uses to which transcripts will be 
put and their eventual disposition should be described explicitly in 
the process of obtaining informed consent. A focus group meeting 
should last for 1-2 h and should follow a five- to six-item agenda. 
Questions should be prepared in advance. Prompts such as posters 
or handouts may help stimulate discussion. A strong introduction 
is very important. The facilitator should take a few minutes to 
explain the purpose of the research and the goals of the group. 


7 Characteristics of Scales 

There are four kinds of scales: nominal, ordinal, interval, and ratio. 
A nominal scale simply assigns the subject to a category, for exam¬ 
ple, yes/no, male/female; the categories have no quantitative rela¬ 
tionship. An ordinal scale orders the categories but does not define 
the magnitude of the interval between categories. An interval scale 
defines the distance between categories. A ratio scale, by defining a 
zero point, allows comparison of the magnitude of categories. 
These distinctions have important implications, and individuals 
embarking on scale construction, rather than simply mimicking the 
categories of existing instruments, should consult more detailed 
discussions [1, 5]. (Clinicians may find it edifying to consider that 
the traditional medical history is phrased almost exclusively in 
nominal inquires.) 

A major drawback of traditional static surveys is the presence 
of floor and ceiling effects. These effects limit a scale’s ability to 
discriminate among individuals and to detect change. A scale can¬ 
not distinguish between two individuals who give the lowest pos¬ 
sible response to every item (floor effect) nor between two who 
give the highest possible response (ceiling effect). Neither can a 
scale detect further deterioration in an individual responding at the 
floor or improvement in an individual responding at the ceiling. 
The magnitude of floor and ceiling effects obviously depends both 
on the items’ difficulty and the respondents’ experience. 

Recent advances in survey design have helped minimize the 
problem of floor and ceiling effects. Computerized adaptive testing 
(CAT) employs a simple form of artificial intelligence that selects 
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Fig. 1 Logic of computerized adaptive testing 


questions tailored to the respondent, shortens or lengthens the 
instrument to achieve the desired precision, scores everyone on a 
standard metric so that results can be compared, and displays 
results instantly [19]. Each administration of an instrument adapts 
to the level of disease impact the respondent reports in any particu¬ 
lar domain or content area. This approach minimizes the number 
of items required to estimate that impact. Typically, adaptive soft¬ 
ware first asks a question in the middle of the impact range, and 
adjusts subsequent items on the basis of the response. With the 
administration of each item, the score and confidence interval are 
recalculated (Fig. 1). New items are administered in an iterative 
fashion until the stopping rule is satisfied. By altering the stopping 
rule, it becomes possible to match the level of score precision to 
the specific purpose of measurement for each individual. For exam¬ 
ple, more precision in scoring will be needed to monitor individual 
progress than to identify presence of disease impact for an indi¬ 
vidual respondent. 

CAT methodology increases the precision of score estimates, 
potentially allowing for reliable and valid clinical use in individual 
patients, eliminates floor and ceiling effects, provides confidence 
intervals specific to the individual, and allows monitoring of data 
quality in real time. Although the costs of development and imple¬ 
mentation may be considerable, the marginal cost of assessing 
respondents who are able to use the technology should be consid¬ 
erably less than that associated with other techniques of data col¬ 
lection. If the additional accuracy and precision available from 
adaptive methods are taken into account, use of the technology 
may be associated with a very favorable cost-effectiveness ratio. 
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8 Validity 


Validity has been defined as the degree to which an item or an 
instrument measures the phenomenon of interest in the popula¬ 
tion of interest. Validity is sometimes described as the extent to 
which the item or instrument measures what it purports to mea¬ 
sure, rather than something else, and is traditionally said to com¬ 
prise content validity, criterion validity, and construct validity. 

Content or face validity describes a conceptual and generally 
qualitative and intuitive assessment: whether the items capture the 
experiences that are important and whether they do so accurately. 
Content validity is most commonly assessed by literature review and 
conversation, whether unstructured and informal individual conver¬ 
sation or using the techniques of focus groups. The individuals 
whose opinions are consulted to determine content validity may be 
professionals accustomed to working with the subjects whose expe¬ 
rience is to be measured or representative subjects themselves. 

Construct validity is said to be present if responses to a mea¬ 
sure exhibit the pattern that might be predicted on the basis of the 
investigator’s theoretical model of the phenomenon of interest. 
For example, in an opioid-naive population, responses to a pain 
scale correlate with the dose of narcotic required to treat postop¬ 
erative pain, and both the pain scale response and the narcotic dose 
decline day by day following uncomplicated surgery. The pain scale 
might be said to show construct validity in that population. On the 
other hand, among the clients of a methadone maintenance pro¬ 
gram, it might not be so easy to use narcotic doses on the days 
following surgery to show construct validity of the pain scale. 

Criterion validity is a special case of construct validity, examin¬ 
ing the correlation between the proposed measure and existing 
measures, the validity of which have been established or presumed. 
A scale measuring depression might be validated against the Beck 
Depression Inventory or the clinical diagnosis of depression by a 
psychiatrist. References 1-3 explore validity as a theoretical con¬ 
struct in more detail, each from a slightly different perspective, 
explaining different subtypes of validity that have been described 
and the mathematical techniques for assessing validity. 


9 Reliability 


Reliability, the second characteristic defining a scale’s performance, 
describes the consistency of its results. Consider a questionnaire 
asking women how old they are. If the same women respond to the 
questionnaire on three consecutive weeks, a reliable instrument 
will yield approximately the same answer every time (allowing for 
birthdays). If the subjects give approximately their real ages, the 
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instrument is valid. Reliability does not require validity: An instrument 
may be highly reliable but also highly biased: the women may all 
consistently deduct 10 years from their actual ages. Their responses 
are reliable, in that they do not vary, but not valid. Conversely, 
however, validity does depend on reliability. If each woman reports 
a different age on each administration, the instrument clearly is not a 
valid measurement of age. Formally, reliability is defined by the 
equation reliability = subject variability/(subject variability + mea¬ 
surement error). 

A scale’s reliability may be estimated in several ways: the two 
commonly reported in studies of patient-reported outcomes are 
test-retest reliability and internal consistency reliability. Test-retest 
reliability compares the results obtained by administering an instru¬ 
ment to the same subjects on two occasions. Of course, if the sub¬ 
jects’ experiences changed between the occasions, any difference 
observed combines the true change and the error attributed to lack 
of reliability. If the second occasion occurs too soon, the subjects’ 
responses on the second occasion may be influenced by memories 
of the first. Empirically, it appears that a reasonable interval is 
somewhere between 2 and 14 days [5]. Internal consistency reli¬ 
ability is the most common measure of reliability described in con¬ 
temporary reports of patient-reported outcomes. It represents the 
logical extension of split-half reliability, which in turn represents a 
form of equivalent-forms reliability. Equivalent-forms reliability 
compares the scores of identical subjects or similar groups using 
two forms of the same instrument. The more highly correlated the 
scores, the more reliable the instrument. Split-half reliability splits 
an instrument’s questions in half and compares scores from the 
two halves, presuming that they represent equivalent forms. 
Internal consistency reliability, most often reported in terms of 
Cronbach’s coefficient n, is equivalent to the average of all possible 
split-half consistency calculations for an instrument. 

Reliability depends on the characteristics of both the instru¬ 
ment and the population. Because subject variability is present in 
the numerator, the same instrument is more reliable in a population 
that is heterogeneous with respect to the characteristics being mea¬ 
sured than in a more homogeneous population. This is a consider¬ 
ation with respect to the usefulness of an instrument in a population 
in which it has not previously been used. However, absent informa¬ 
tion from other comparable measures, the homogeneity of the 
population and the reliability of the instrument in that population 
are a matter of conjecture until some data are collected. 

Reliability coefficients are correlation coefficients, with a range 
of 0-1. The scale’s reliability determines the width of the confi¬ 
dence interval around a score, and the reliability needed depends 
on the uncertainty that can be tolerated. In 1978, Nunally sug¬ 
gested a reliability standard of 0.7 for data to be used in group 
comparisons and 0.9 for data to be used as the basis of evaluating 
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individuals [20]. These standards are often cited, but were arbitrary 
when proposed and should not be considered absolute. Some of 
the statistics in which reliability is reported give higher values than 
others [1]. Finally, by way of comparison, brachial systolic blood 
pressure, the basis of so many individual treatment decisions in 
clinical practice, was has been reported to have a reliability coefficient 
of 0.74 [21]. 


10 Classical Test Theory and Item Response Theory 

Since the middle of the twentieth century, most analyses of tests 
and survey data followed the principles of classical test theory, 
which emphasize the characteristics of scales and partitions 
observed scores into true score and error. Classical test theory 
describes the characteristics of scales in particular populations; 
application to new populations requires the empiric measurement 
of reliability and the establishment of new norms. Further, classical 
test theory does not distinguish the characteristics of the instru¬ 
ment from the characteristics of the individual whom it measures. 
A second limitation of classical test theory is that it offers a single 
reliability estimate for a scale, even though precision is known to 
vary with the level of a trait. Finally, classical test theory requires 
long instruments to achieve both precision and breadth [22]. 

Item response theory, or latent trait theory, is a computationally 
more intensive approach, which emphasizes the item rather than the 
scale and allows distinction and definition of the difficulty of a survey 
or test item and the level of the trait being measured in the individ¬ 
ual. Rasch models are closely related and generally considered a sub¬ 
set of item response theory. An important attribute of item response 
theory is that it allows the ranking of items in difficulty and the 
development of an item bank, facilitating computerized adaptive 
testing and making it a key aspect of PROMPTs strategy as it enables 
the development of more tailored measurement [23, 24]. This tech¬ 
nique is widely used in educational testing and is becoming increas¬ 
ingly important in the assessment of patient-reported outcomes in 
clinical settings [25]. 


11 Comparing Groups and Assessing Change Over Time 

A research plan including study of patient-reported outcomes should 
include definition of what magnitude of difference is considered 
clinically important and whether the difference is between groups in 
a cross-sectional study or in the same group over time. There are at 
least two approaches. One approach compares the magnitude of 
observed difference to differences between other defined groups. A 
conceptually more rigorous, if intuitively less appealing, approach is 
to calculate an effect size; one calculation indexes the difference by 
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the standard deviation of scores, where a value of 0.2 is considered 
small, 0.5 is moderate, and 0.8 is large [26-28]. 

A second and even more important issue in the longitudinal 
study of patient-reported outcomes is to minimize missing data. 
Missing items can make entire scale unusable, and missing data can 
introduce important bias. Response to a questionnaire often 
requires more involvement by the study participant than giving a 
blood specimen or undergoing physical or radiologic examination, 
and it involves interaction between the participant and the study 
personnel. It is important that everyone involved understand that 
a response omitted because the participant does not feel like 
responding, for whatever reason, impairs the interpretability of the 
entire study. Almost inevitably, however, some patient-reported 
outcome data will be missing because of patient death or censoring 
for other reasons. Limiting analysis to those cases on which all 
observations are available risks important bias. Missing data cannot 
be ignored if toxicity, worsening of a disease, or the effect of a 
treatment may be associated with the absence of the data. Such 
circumstances seem to include many of the very situations in which 
one would want to measure patient-reported outcomes. A variety 
of sophisticated statistical methods are available to deal with this 
problem; in the absence of any clearly correct approach, it may be 
helpful to perform several analyses to determine whether the con¬ 
clusions are sensitive to the precise technique used [29, 30]. 
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Chapter 12 


Randomized Controlled Trials 4: Biomarkers 
and Surrogate Outcomes 

Claudio Rigatto and Brendan J. Barrett 

Abstract 

Biomarkers are defined as anatomic, physiologic, biochemical, molecular, or genetic parameters associated 
with the presence, absence, or severity of a disease process. As such, biomarkers may be useful as prognostic 
and diagnostic tests. Establishing the utility of a given biomarker as a prognostic or diagnostic test requires 
the conduct of carefully designed cohort studies in which the biomarker and the outcome of interest are 
measured independently. The design and analysis of such studies is discussed. Surrogate outcomes in clinical 
trials consist of events or biomarkers intended to reflect important clinical outcomes. Surrogate outcomes 
may offer advantages in providing statistically robust estimates of treatment effects with smaller sample sizes. 
However, to be useful, surrogate outcomes have to be validated to ensure that the effect of therapy on them 
truly reflects the effect of therapy on the important clinical outcomes of interest. 

Key words Biomarkers, Surrogates, Cohort studies, Clinical trials, Statistical methods 


1 Introduction 


Biomarkers have become very important to medical research and 
practice in recent years. The search term biomarker is associated with 
6,62,548 hits in the PubMed systems as of April 30th, 2014, with 
30 % of them being within the past 5 years. Biomarkers are often used 
to address issues of pathogenesis and as important indicators of prog¬ 
nosis. Furthermore, a role also exists for biomarkers in making a diag¬ 
nosis and in assessing the efficacy of therapies or interventions. In the 
latter role, biomarkers are one of a set of possible surrogate outcomes 
that can be useful in rendering clinical trials feasible or efficient. 


2 Definition and Uses of Biomarkers 

2.1 Definition Biomarkers can be defined as anatomic, physiologic, biochemical, 

molecular, or genetic parameters associated with the presence, 
absence, or severity of a disease process. Depending on their 


Patrick S. Parfrey and Brendan J. Barrett (eds.), Clinical Epidemiology: Practice and Methods, Methods in Molecular Biology, 
vol. 1281, DOI 10.1007/978-1-4939-2428-8_12, © Springer Science+Business Media New York 2015 

207 





208 


Claudio Rigatto and Brendan J. Barrett 


22 Conceptual 
Relationship Between 
Biomarkers, Risk 
Factors, and Surrogate 
Outcomes 


precise nature, biomarkers are detectable and quantifiable by a 
variety of methods including physical examination, laboratory 
assays, and radiological techniques. Often, biomarkers are param¬ 
eters that are known or hypothesized to be causally involved in 
mechanisms of disease progression. For example, blood pressure 
and LDL cholesterol are biomarkers of, and causally linked to, 
development of atherosclerosis and cardiovascular disease. Evidence 
of a known or suspected causal link is not strictly necessary, and 
some biomarkers (e.g., neutrophil gelatinase associated lipocalin 
(NGAL) in acute renal failure [1], antinuclear antibody (ANA) in 
lupus, anti neutrophil cytoplasmic antibody (ANCA) in vasculitis) 
do not have clearly understood pathophysiologic roles. The essen¬ 
tial properties of a biomarker are that it be measurable in some 
way, and that it be associated with the disease of interest. 

It is evident from the definition that some clinical signs, such as 
those observed in the course of a clinical exam (e.g., adenopathy, 
a rash, crackles on chest auscultation, number of swollen joints), 
qualify as biomarkers, as do physiologic variables such as blood 
pressure, even though we are not used to thinking of them as such. 
Similarly, radiological variables (e.g., tumor size on CT, luminal 
narrowing on coronary angiography) can also function as biomark¬ 
ers. Parameters more commonly thought of as biomarkers include 
proteins such as serum C-reactive protein or troponin T in cardio¬ 
vascular disease, immunoproteins such as antinuclear antibody in 
lupus, or genetic biomarkers such as specific gene polymorphisms. 

These terms overlap significantly in concept and meaning. The 
term risk factor was coined by epidemiologists over 50 years ago to 
denote a parameter whose presence or level was associated with a 
statistically higher probability over time of observing a specific dis¬ 
ease in a population. Typically a risk factor was a clinically evident 
and measurable characteristic, such as age and gender, or behavior, 
such as smoking or alcohol consumption, but was later extended to 
include many easily measurable physiological and laboratory 
parameters (e.g., blood pressure, cholesterol). While the latter 
quantities would also meet the definition of biomarkers, static, 
constitutive or irreversible risk factors such as age, gender or race 
would not. Behaviors can never be considered biomarkers, though 
they are perfectly legitimate risk factors. 

The term biomarker originated in the context of drug discovery 
for cancer, infections, and cardiovascular disease. Many of these 
diseases evolve over along periods of time and have imperfect ani¬ 
mal analogues, making the process of drug discovery lengthy and 
difficult. Researchers needed a parameter or metric which might 
indicate quickly whether an agent possessed promising biological 
activity directed against disease in vivo in humans. These parame¬ 
ters originally fell into two categories: surrogates of disease pro¬ 
gression (e.g., tumor size on X-ray, joint erosions, coronary artery 
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narrowing), or measurement of physiologic, biochemical, or 
molecular parameters known to be involved in one or more mecha¬ 
nisms of disease development (e.g., cholesterol level, blood pres¬ 
sure level). Implicit in the definition of biomarker, therefore, is 
the notion of measuring disease activity or progression. Because 
biomarkers in some way measure mechanisms or intermediate 
stages in disease evolution, it is not surprising that they are often 
associated with clinical manifestations of disease, and thus can be 
considered risk factors in many cases [2]. 

Surrogates are parameters or measurements that take the place 
of or stand in for other measurements. In clinical research, one is 
frequently interested in hard outcomes such as death or the devel¬ 
opment of some specific disease state (e.g., development of end 
stage renal disease). Many diseases evolve slowly over time, making 
it time consuming and costly to study these end points directly. 
Substitution of a biomarker (e.g., microalbuminuria in diabetic 
nephropathy) which is measurable sooner can significantly shorten 
the length of a study and significantly reduce costs. A biomarker 
used in this way is termed a surrogate outcome. To be useful, the 
surrogate must be a biomarker possessing very tight association 
with the development of the disease state in question. Thus, not all 
biomarkers can be surrogates. Many pitfalls exist, limiting the use or 
interpretation of a surrogate outcome. These issues are discussed 
separately below. 

Because biomarkers are by definition associated with disease pro¬ 
cesses or development, they are potentially useful as markers of 
prognosis. To be useful prognostically, a biomarker must possess 
several properties (Table 1). The magnitude of the association must 
be high, so that the separation in prognosis between biomarker 
categories is high, and the effect must be independent of other 
prognostic factors. In addition, the biomarker must improve pre¬ 
diction of outcome beyond clinical variables alone. In statistical 
terms, this means the biomarker must improve the discrimination 
(e.g., c-statistic, integrated discrimination improvement (IDI)) and 
reclassification metrics (net reclassification improvement, vide infra) 


Table 1 

Characteristics of an ideal prognostic biomarker 

Tight association with outcome (e.g., hazard ratio or odds ratio) 

Statistically independent 

Unconfounded by other prognostic factors 

Must improve discrimination and reclassification metrics 

Generalizable (e.g., validated in numerous studies and settings) 
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when added to a multivariate model. Lastly, the prognostic usefulness 
of biomarker must be shown to be generalizable, which means 
validating the association in multiple studies and settings. In reality, 
very few current biomarkers possess these characteristics, which is 
one reason why biomarkers sometimes add disappointingly little to 
prognostic or diagnostic certainty [3], despite the common 
assumption that they are better predictors of disease outcomes 
than simple clinical data. 

2A Use in Diagnosis The discovery of biomarkers capable of replacing the biopsy or 

other invasive technologies in the diagnosis of many medical ailments 
is the “holy grail” of biomarker research. The requirements for a 
useful diagnostic biomarker are if anything more stringent than for 
a prognostic biomarker. In order to replace a “gold standard” test, 
such as a biopsy for the diagnosis of cancer, the ideal biomarker 
must achieve near perfect sensitivity and specificity, with a receiver 
operating characteristic (ROC) curve that is nearly a perfect square. 
No biomarker currently fulfills the requirements of an ideal diag¬ 
nostic test. The diagnostic utility of a particular biomarker, there¬ 
fore, depends on the strength of the association with disease status 
and whether the test performance characteristics are better than 
that of other existing diagnostic markers. The latter are typically 
measured by parameters calculated from a 2 x2 contingency table, 
such as sensitivity (Sens), specificity (Spec), and positive and neg¬ 
ative predictive values (PPV, NPV). In the case of biomarkers 
with multiple informative levels, an Nx 2 table can be created and 
positive and negative likelihood ratios (LR) calculated for each of 
the N levels [4]. 

A direct mathematical relationship exist between the odds ratio 
(OR) for disease (a measure of strength of association) and test 
performance characteristics: 

OR = [Sens/(1-Sens)] X [Spec/(1-Spec)] = [PPV/(1-PPV)] X [NPV/(1-NPV)] (1) 

It is not difficult to see that very high degrees of association 
(i.e., very high OR) are necessary if a biomarker based test is to 
yield acceptable performance characteristics. Suppose we desire a 
test with 95 % sensitivity and specificity. Substituting into Eq. 1, we 
see that the odds ratio for disease with a positive test (i.e., a test 
above the threshold) would need tobe(0.95/0.05)x(0.95/0.05) = 
361! Relaxing our requirements to 90 % sensitivity and speci¬ 
ficity, we would still require an observed odds ratio of 
(0.9/0.1)x(0.9/0.1) = 81. Suppose instead we would like to use 
our biomarker test as a screening test for a rare disease, and so 
require high sensitivity (say, 95 %) but can accept mediocre speci¬ 
ficity (say 50 %). The odds ratio for a positive biomarker test 
would have to be (0.95/0.05) x (0.5/0.5) = 19. Alternatively, 
suppose we wish the biomarker in question to function as a confir¬ 
matory test, and so require a high specificity (99 %) but can accept 
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a low sensitivity (20 %). The OR associated with a biomarker 
having these performance characteristics would be (0.2/0.8)x 
(0.99/0.01) = 123! It is evident from these and other examples 
that a biomarker is unlikely to yield a useful diagnostic test unless 
it possesses an extremely high association with disease (e.g., 

OR»10). 

2.5 Use in Studies Biomarkers may be used in interventional studies as either surro- 

of intervention gates (discussed below), or as evidence for an effect on intermediary 

mechanisms of disease. In the latter case, they can lend an addi¬ 
tional mechanistic, causal dimension to a clinical study. For exam¬ 
ple, consider a randomized clinical trial of cholesterol lowering with 
placebo vs. low and high dose HMG-CoA reductase inhibitor 
(i.e., “Statin”). Measurement of the clinical outcome alone can 
show only whether treatment has clinical efficacy. Simultaneous 
measurement of serum cholesterol in all groups will allow confirma¬ 
tion of whether the effect is associated with cholesterol lowering, as 
would be expected based on the known mechanisms of action. 
Since HMG-CoA inhibitors may have pleiotropic and anti¬ 
inflammatory effects, the investigators might wish to address the 
hypothesis that statin treatment reduces inflammation and that this 
reduction influences outcome. This could be accomplished by mea¬ 
suring inflammatory markers such as CRP or IL-6 levels during 
the trial. Thus, integration of biomarker measurements in clinical 
studies can confirm known mechanisms of action and explore novel 
and hypothetical mechanisms of disease. 


3 Validation of Biomarkers 


As knowledge of disease mechanisms deepens, new compounds, 
proteins, and genes are discovered which might prove useful in 
diagnosing diseases or in prognosticating outcomes. Thus, a com¬ 
mon question faced in biomarker research is how to determine the 
usefulness of a candidate biomarker. Appropriate design of clinical 
studies, adhering to fundamental epidemiological principles, is a key 
component of this process. Unless the investigator is experienced in 
clinical studies of diagnosis and prognosis, collaboration with an 
epidemiologist and a statistician in the planning stages is highly rec¬ 
ommended, because no amount of post hoc statistical manipulation 
can overcome fatal flaws in study design. Finally, before being widely 
accepted in clinical use, the observations of one study should be 
confirmed in another independent set of patients. 


3.1 Designing 
a Study to Assess 
the Prognostic Value 
of a Biomarker 


Suppose we wish to know if a biomarker x is useful in predicting 
how long it will take patients at a defined stage of a chronic illness 
to arrive at a defined disease end point. For example, x could be 
proteinuria, the disease in question Stage 2 chronic kidney disease, 
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and the defined disease outcome initiation of dialysis. The appropriate 
study design to address this question is a cohort study. 

The fundamental steps in the execution of a cohort study are 
as follows: 

1. Cohort Assembly : The investigator must recruit a cohort of 
patients having the condition in question. Alternatively, a his¬ 
torical cohort may be used, provided (1) adequate stored sam¬ 
ples are available for biomarker measurement and (2) the design 
requirements of a good cohort study are met (vide infra). 

2. Biomarker Assessment : Levels of the biomarker in question 
must be assessed at baseline in all patients. 

3. Assessment of other potential prognostic or confounding vari¬ 
ables : Levels of other variables known to be predictive of the 
outcome must also be assessed at baseline. These measure¬ 
ments will permit multivariate modeling to assess indepen¬ 
dence and freedom from confounding in the analysis phase, as 
well as assessment of the incremental predictive value utility 
(i.e., improvement in net reclassification) of testing for the 
biomarker. 

4. Unbiased and unambiguous assessment of outcome: Outcomes 
must be defined unambiguously and measured in all patients. 
It is important that the surveillance for outcomes be identical 
for all patients in the cohort, to avoid the problem of differen¬ 
tial surveillance bias. Completeness of follow-up in all patients 
is also very important, since patients that drop out from a study 
are systematically different from those that remain and so may 
bias the results of the study. In cases where the outcome has a 
subjective component (e.g., extent and severity of joint involve¬ 
ment in rheumatoid arthritis), it is important that the assessor 
be unaware (blind, masked) to the level of biomarker in order 
to avoid the possibility of bias. It is important that the out¬ 
come be defined completely without reference to and indepen¬ 
dently of the biomarker being examined. This can be a problem 
when the biomarker in question is also a criterion for outcome. 
This scenario is relatively common in the field of rheumatology, 
where biomarkers of disease activity (e.g., ANA, complement 
levels, white blood cell count) may also be a component of the 
outcome definition (e.g., a disease flare in Lupus). 

3.2 Analytical 
Considerations 
in a Prognostic 
Biomarker Study 


A thorough description of the approach to the analysis and multi¬ 
variate modeling of cohort studies is the subject of entire text¬ 
books [5], and well beyond scope of this chapter. The single most 
important principle to be observed is that the plan of analysis 
should be discussed with a statistician and thoroughly outlined 
in the planning stages of the study, and not after the study is finished! 
Mathematically disinclined readers can stop here and skip to the 
section on diagnostic studies. Nevertheless, since a general 
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understanding of the approach to analysis can be helpful in designing 
a study, we discuss a few broad principles of analysis below. 

The objective of the analysis is fourfold: (1) to estimate the 
strength and statistical significance of the association between bio¬ 
marker levels and outcome (2) to establish statistical independence 
of the biomarker in a multivariate model (3) to estimate the degree 
of confounding with other parameters in the model (4) to establish 
the degree to which the biomarker improves the predictive power 
of the prognostic model. 

Survival analysis provides the most natural and informative way 
to analyze cohort data and the effect of biomarker levels on out¬ 
come. Survival analysis (also known as time-to-event analysis) 
very naturally handles the problems of censored data and unequal 
follow-up times which often can occur in cohort studies. Logistic 
regression is also commonly used if there is minimal censoring and 
we are only interested in risk over a defined period of time. Poisson 
regression and other GLM techniques can also be used although 
these are less common. 

The most common techniques used in survival analysis are 
Kaplan-Meier (K-M) analysis (bivariate, i.e., one outcome and one 
predictor variable) and Cox’s proportional hazards regression (mul¬ 
tivariate, i.e., multiple predictor variables). For a more complete 
treatment of these techniques, their pitfalls and assumptions, the 
interested reader is directed to the cited references [6]. The first 
analytic objective is usually achieved using a bivariate K-M analysis. 
In this analysis, continuous biomarkers (e.g., serum CRP concen¬ 
tration) are typically stratified into four or more levels (e.g., quartiles), 
because this involves fewer assumptions about the mathematical 
nature of the relationship between biomarker levels and outcome. 
What the analyst is looking for, in addition to statistical significance, 
is a strong association between biomarker and outcome (e.g., a haz¬ 
ard ratio of at least 10 between highest and lowest quartile/decile). 
Moreover, a smooth, graded association between biomarker quartile 
and the hazard ratio for outcome is reassuring and supportive of a 
causal association. 

The second and third objectives are usually achieved by con¬ 
structing appropriate multivariable models, typically using Cox 
regression. In this stage, the biomarker is best treated as a continuous 
variable to maximize power. Log transformation may be required to 
tame skew. The other prognostic variables included in the models 
should possess face validity and be known to predict outcome. 
The statistical independence of the biomarker is usually established 
by creating a model which includes all variables associated with out¬ 
come in the bivariate analysis (^<0.1), and observing whether 
removal of the biomarker variable from the model results in a significant 
change in the -2 log-likelihood parameter of the model. 

Several metrics of model performance are then calculated 
and compared between the full multivariate model including the 
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biomarker and the alternative (or base) model which contains all 
the other important prognostic marker but excludes the biomarker 
in question. The operative principle here is that to be judged clini¬ 
cally useful, a new biomarker must improve prediction significantly 
over what is already possible using existing clinical data. 

The main metrics of predictive utility are discrimination and 
reclassification. Discrimination measures the ability of a model to 
accurately assign a higher probability to patients who have the event 
of interest, versus those who do not. The most commonly used 
metrics of discrimination are the concordance or c-statistic and the 
integrated discrimination improvement (IDI) index [7, 8]. 

The C statistic is defined as the proportion of times the model 
correctly discriminates between a randomly selected pair of case and 
control individuals, and is mathematically equivalent to the area 
under the receiver operating characteristic curve (AUROC) of the 
logistic or proportional hazards model. As with the AUROC, a 
c-statistic of 0.50 indicates that the model performs no better than 
chance; a c statistic of 0.70-0.80 indicates good discrimination; and 
a c statistic of greater than 0.80 is consistent with excellent discrimi¬ 
natory ability. Comparing the magnitude and statistical significance 
of the change in c-statistic between the base and the biomarker 
model is the traditional metric of biomarker usefulness. 

Integrated Discrimination Improvement A limitation of using the 
c-statistic for estimating improvement in discrimination is that it 
exhibits asymptotic behavior: as the model c approaches 1, it 
becomes increasingly difficult to show a meaningful difference in 
C-statistics despite real improvements in model prediction. An alter¬ 
native and more sensitive measure of improvement is the integrated 
discrimination improvement index (IDI). The IDI measures the 
difference in discrimination slopes between the two models 
(i.e., mean predicted probability for those with the outcome vs. 
those without), and describes this on an absolute and relative scale. 
As such, the IDI can be an effective method for comparing dis¬ 
crimination between two models where differences in C statistic 
may be negligible. 

Reclassification : In clinical medicine, treatments and tests are often 
prescribed based on the predicted risk category of having an event. 
When a new prediction model is developed, it is important to con¬ 
sider whether it classifies patients into more appropriate risk strata 
than the old model. The new model may assign a given patient to 
the same risk category, a lower risk category, or a higher risk cate¬ 
gory relative to the old model. If the patient has an event, the new 
model can be considered successful if it assigns that patient to a 
higher risk stratum, but unsuccessful if it erroneously assigns a lower 
risk. Similarly, for patients who do not have an event, the new model 
is successful if reclassifies to lower risk, and unsuccessful if it reclas¬ 
sifies to a higher risk stratum. The Net Reclassification Index is 
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3.3 Sample Size 
Considerations 
for a Prognostic 
Biomarker Study 


essentially the sum of these successes and failures [9]. Positive 
values for NRI indicate correct net reclassification and negative values 
indicate incorrect reclassification. The NRI should be calculated 
using clinically accepted risk categories, wherever possible. For 
example, the NRI of a model for cardiovascular event prediction 
might use the Framingham risk categories as the basis for defining 
successful and unsuccessful reclassification. 

Maximum likelihood based estimation procedures, used in generat¬ 
ing coefficients of logistic and Cox regression models, require a 
minimum of ten outcomes per independent variable included in the 
multivariate model to ensure model stability [10]. As an example, if 
the outcome of interest is death, and in addition to the biomarker 
it is anticipated that six variables (biomarker plus five adjustment 
variables) will need to be included in the model, then a minimum 
of 10x6 = 60 deaths will need to be observed in the cohort. If the 
mortality of the disease in question is 20 % over 2 years, then 300 
patients will need to be observed for 2 years. 

Although the above criterion must be satisfied, it does not 
truly estimate the sample size required to measure a desired 
improvement in discrimination. Two approaches can be employed 

1. Estimate {guess) the proportion of patients who will experienee the 
outeome in the high biomarker group vs. the low biomarker group 
(e.g., Biomarker level above the median vs. below the median). 
This estimate can be based on prior studies, or defined accord¬ 
ing criteria of clinical significance. For example, if we are inter¬ 
ested in a biomarker with high prognostic power, we may only 
be interested in detecting a relative risk of at least 8-10 for 
mortality in patients with biomarker levels above the median 
compared with those below the median. If the overall mortal¬ 
ity is expected to be 20 %, and the relative risk we are looking 
for 9, then we would require 18 % mortality in the high bio¬ 
marker group vs 2 % in the low biomarker group. The sample 
size can then be calculated using standard formulae for com¬ 
paring two proportions. In the example cited, the minimum N 
required will be 111 (assuming two sided a = 0.05 and (3 = 0.2). 
In this particular example, applying the first criteria resulted in 
a higher minimum N than applying the second, but this is not 
always the case. Nevertheless, to satisfy both criteria, one 
should select the higher of the two numbers. 

2. Estimate the degree of improvement in the e-statistie. Although 
more complex and beyond our scope here, methods exist to 
calculate sample size for a given anticipated difference in the 
c-statistic, and references have been provided for the interested 
reader [7, 11]. 

In addition, if the cohort is to be prospectively accrued over 
time, an estimate of the attrition rate must be factored in and the 
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3.4 Designing 
a Study to Measure 
the Diagnostic 
Usefuiness 
of a Biomarker 


N increased accordingly. For example, if we anticipate a dropout 
rate of 10 % over 2 years, and we need 300 patients for the analysis, 
we would need to enroll 333 patients to ensure 300 patients are 
followed to the end of the study. 

Finally, sample size estimates are the product of multiple assump¬ 
tions, with the most problematic being the degree discrimination 
afforded by the biomarker. It is often useful to model sample sizes 
required for a range of plausible assumptions about effect size. 

A diagnostic study attempts to measure how well a diagnostic test 
predicts the presence or absence of a specific disease state. For 
example, a researcher might wish to know whether the presence of 
cleaved fragments of p2 microglobulin in the urine can correctly 
distinguish acute rejection of a transplanted kidney from other 
causes of acute transplant kidney dysfunction (e.g., dehydration, 
acute tubular necrosis, calcineurin toxicity, viral nephropathy). In all 
cases, the result of the test (cleavage products present vs absent) is 
compared to the presence/absence of disease as assessed by a “gold 
standard” (see below). In order to address this question, the inves¬ 
tigator must: 

1. Assemble a diagnostic eohort : The investigator must enroll a 
cohort of patients in whom the diagnosis is suspected. In the 
example cited, patients with transplanted kidneys and evidence 
of acute kidney dysfunction (e.g., a rise in serum creatinine, 
oliguria) would be the target population for the hypothetical 
study. 

2. Define the “gold standard ” for diagnosis in all patients : The 
term “gold standard” refers to a test or procedure the result of 
which is considered definitive evidence for the presence or 
absence of disease. Although conceptually straightforward, 
identifying an appropriate gold standard can be tricky. In the 
example cited above, a kidney biopsy might be considered a 
logical gold standard procedure for identifying rejection in a 
transplanted kidney. A positive gold standard might therefore 
be defined as presence of tubulitis, widely accepted as the path¬ 
ological hallmark of acute rejection. However, tubulitis is a 
continuum from very mild to severe, raising the question of 
how much tubulitis needs to be present. In addition, what other 
criteria might be needed in order to infer that the tubulitis is 
the main culprit clinically? Suppose there are also prominent 
viral inclusions or striped fibrosis (thought to represent calci¬ 
neurin toxicity), what then? Thus, defining the gold standard 
is not merely a question of choosing a test (or combination 
of tests!); it is a question of explicitly defining the criteria and 
procedures employed to judge whether a disease is present or 
absent. This can be quite a challenge, especially in situations 
where widely accepted criteria or procedures do not exist. 
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3.5 Analytical 
Considerations 
in a Diagnostic 
Biomarker Study 


From a practical viewpoint, the gold standard chosen by the 
investigator must at minimum be unambiguously defined, and 
must employ the best techniques/procedures/criteria cur¬ 
rently available to define disease status. 

3. Assess biomarker status and disease status independently in all 
patients in the cohort : The key concepts here are “all” and 
“independently”. Dropouts (patients who receive one or other 
test but not both) cannot be included in the analysis and may 
reduce the power of the study. If the dropouts are a non- 
random sample (highly likely), their exclusion may bias the 
data. If outcomes are assessed in only a non-random portion of 
the cohort (e.g., if pulmonary angiography for gold standard 
diagnosis of embolism is done only in patients with an elevated 
d-dimer level), then the assumption of independence is vio¬ 
lated, since the assessment of the gold standard will be condi¬ 
tional on the level of the biomarker in question. For the same 
reasons, assessment of the gold standard should be done with¬ 
out any knowledge of the level of the biomarker, and vice 
versa, to prevent bias. In practice, this may be achieved by 
blinding the assessor of the gold standard from any knowledge 
of biomarker status. These two criteria of completeness and 
unbiased independence of assessment of both biomarker and 
disease status are almost impossible to satisfy in administrative 
datasets, retrospective cohorts (unless stored biological sam¬ 
ples are available) and other “found” data, which is why these 
data sources are in general unsuitable for assessment of diag¬ 
nostic test performance. 

An in depth discussion of the analytical tools used to characterize test 
performance is beyond the scope of this chapter (see references). 
We provide here a broad outline of the analytical approach, high¬ 
lighting critical aspects of the process. 

1. Dichotomous biomarker. The simplest case is when the biomarker 
is dichotomous, i.e., either present or absent. In such instances, 
a single 2x2 contingency table is created. Patients are classified 
into four groups according to biomarker and disease status 
(Fig. 1). Test performance characteristics, with associated 95 % 
confidence limits, can then be calculated in the usual way. 
The most helpful statistics are the positive and negative predic¬ 
tive values (PPV, NPV), also known respectively as the true 
positive and true negative rates (TPR, TNR). To be useful, the 
test must have a high PPV or NPV, or both. 

2. Continuous biomarker The basic approach involves converting 
the continuous biomarker into a dichotomous variable. This is 
done by assigning different “thresholds” above which the 
biomarker is considered a positive test. The individual data are 
then reclassified as positive or negative for each threshold, and 
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True Endpoint 



Fig. 1 The graph shows the relationship between true clinically important and 
surrogate end points under treatment with a new intervention (I) and standard con¬ 
trol therapy (C). Under both I and C individually there is a perfect correlation between 
the level of the surrogate and true end points as reflected by the straight lines. 
However, because of the different slopes of the lines, unless the slopes and inter¬ 
cepts were known, it would be possible to reach an incorrect conclusion about 
the average effect of I and C on true end points (shown as tlav and tCav) from 
knowledge of their average effects on the surrogate (shown as Slav and sCav) 


can be compared to the disease state (present/absent) using 
the usual 2x2 contingency table. A different contingency table 
is thus generated for each threshold, from which the sensitivity 
and specificity associated with each threshold of the biomarker 
are calculated. A plot of 1-specificity vs. sensitivity for each 
threshold is then created, forming the ROC curve. Typically, 
ten or more points (and thus thresholds) reflecting the 
observed range of the biomarker are necessary to adequately 
plot the curve. Modern statistical programs automate this pro¬ 
cess and can generate curves based on hundreds of thresholds. 
The area under the ROC curve varies between 0.5 (no dis¬ 
crimination, a straight line) and 1.0 (perfect discrimination). 
An area between 0.8 and 0.9 is generally considered “excellent” 
test discrimination. The area calculated from separate ROC 
curves generated from two different biomarkers simultane¬ 
ously and independently measured in the same population can 
be formally compared, permitting conclusions about whether 
one biomarker has statistically better diagnostic performance 
than another. The optimum threshold for a continuous bio- 
marker is usually defined as the point maximizing both sensitiv¬ 
ity and specificity, and corresponds to the point on the ROC 
curve closest to the point (0,1) on the ROC graph. 
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3.6 Sample Size 

Considerations 

for a Diagnostic Study 


Since the area under the ROC curve is a measure of the diagnostic 
accuracy of the biomarker, a logical approach is to use this statistic 
as the basis for sample size calculations. A paper by Hanley and 
McNeil describes an approach using the Wilcoxon statistic and, 
more importantly, contains useful nomograms and tables that per¬ 
mit direct calculation of sample size needed for (1) estimation of 
the area under the ROC within a desired range of precision and (2) 
detection of a statistically significant difference between the ROC 
areas of two biomarkers [11]. 


3.7 Establishing 
Generalizability: 
Derivation 
and Validation Sets 


A single analysis is never sufficient to establish the validity of a given 
biomarker. Estimates of association (hazard, odds and rate ratios) 
and test performance (sensitivities, specificities, etc.) can vary mark¬ 
edly from one study to the next, and can often diminish markedly 
from early studies to later studies. Early prognostic studies are often 
small and analyzed intensively, all of which increase the risk of model 
overfitting and thus observing spuriously inflated associations. Early 
diagnostic studies often enroll patients exhibiting a discontinuous, 
bimodal pattern of disease, with patients either having no disease or 
having severe disease. Biomarkers typically perform better in these 
bimodal cohorts than in more representative cohorts where a 
smoother spectrum of disease severity is evident, a phenomenon 
called “diagnostic spectrum effect (or bias)” [12]. 

Before being widely accepted in clinical use, the observations of 
one biomarker study should be objectively confirmed in multiple, 
independent, clinically relevant populations. At minimum, confir¬ 
mation in at least one other independent set of patients is required. 
Since the fundamental purpose of repeating a study is to show the 
generalizability of the findings of the first study to other similar pop¬ 
ulations and settings, this final “validation step” is ideally conducted 
as a completely separate study, in a separate place and at a different 
time. If the results of such studies are concordant, then there will be 
reasonably strong evidence of generalizability. 

Because conducting two completely independent studies is 
very costly and often unfeasible, an alternative is to conduct a single 
study in which patients are randomly grouped into derivation and 
validation sets. All the model building and statistical testing is con¬ 
ducted in the derivation set (hence the name); the parameters of 
the same models are then recalculated using data from the 
validation set, and compared. Because the main purpose of the 
validation set is to demonstrate that the model parameter estimates 
(e.g., hazard ratios, odds ratios, specificities, C statistics) are similar 
to those in the derivation set, and not to retest statistical signifi¬ 
cance, the validation set can frequently be smaller. Typically, 
the validation set is half the size of the of the derivation set, but 
can be larger or smaller depending on need and data availability. 
The advantage of randomly selecting derivation and validation 
sets from the same cohort is that the study apparatus does not 
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have to be duplicated. The disadvantage is that random samples 
from a single cohort are more likely to show congruent results than 
samples from two completely distinct cohorts, and thus this strat¬ 
egy is a less rigorous test of generalizability. 


4 Surrogate Outcomes 

Outcome measures or end points are important to all forms of 
clinical research. For studies of disease prognosis, they usually con¬ 
sist of some identifiable disease state (e.g., development of a cancer, 
onset of end stage kidney disease, death, or death from some par¬ 
ticular cause). Studies of prognosis might seek to describe time from 
some defined start point (at which point all those studied would be 
free of the outcome of interest) until either the development of the 
outcome of interest, or the end of an observation period. Such stud¬ 
ies might also assess how well factors measured at the start time, or 
during the early period of observation predict the outcome of inter¬ 
est. For purely descriptive studies, the frequency of the relevant clini¬ 
cally important outcome does not have to be high to make the study 
worthwhile. Study sample size, and by extension feasibility, is partly 
determined by frequency of outcome, but if outcomes are rare, 
a study of a few hundred cases may still suffice to establish that with 
reasonable precision. For example, a study of 400 persons where 
death was observed to occur in 2 % over a 5-year period of observa¬ 
tion would be associated with a 95 % confidence interval ranging 
from about 1-4 % around the estimate. Such a degree of precision 
might be adequate in many instances. 

In situations where the research question relates to the effect of 
an intervention on outcome, use of “hard” or clinically important 
end points tends to be most persuasive. Clinically meaningful out¬ 
comes are those that reflect how people feel, function or survive. 
Ultimate outcomes must reflect both the possibility of benefit and 
harm associated with the choice of intervention. Such outcomes 
include death rates, disease events and measures of quality of life. 
For example, in comparing the effect of bare metal stents to drug 
eluting stents in the therapy of coronary disease, the most impor¬ 
tant outcomes would include patient survival, rate of subsequent 
myocardial infarction, and possibly need for future revasculariza¬ 
tion. Effort and care are also required to determine when an event 
has occurred. One common strategy is to have blinded adjudication 
by an end point committee. Demonstrating an effect on such out¬ 
comes will tend to be persuasive to clinicians, patients and payers as 
the meaning of the effect is usually understandable and the events 
avoided with the more efficacious therapy are themselves meaning¬ 
ful. However, comparative trials often have to be very large to prove 
superiority of one therapy over another if the rate of events in 
controls is low and the minimum clinically important difference in 
outcomes between therapies is small. For example, a 5 % rate of 
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death over 4 years was associated with bare metal stents in a pooled 
analysis of randomized trials comparing bare metal stents to siroli- 
mus eluting stents [13]. Now, suppose that a 1 % difference in death 
rates over a 4 year period would be important to identify with 
statistical significance (^<0.05) in a comparative trial. Such a trial 
would have to enroll over 9,000 subjects to have a 90 % chance 
(power) to detect such a difference in death rates. Clearly, a trial of 
this magnitude would be both costly and difficult to run. 

There are several options available to limit sample size when 
designing a trial. These include recruitment of subjects at higher 
risk for an outcome event, but doing so affects the generalizability 
of trial results. Another option is to use composite outcomes. 
Individual components of composites may be uncommon, but 
together the rate of composite events may be high enough to limit 
the sample size required. Components of composite outcomes 
include events that share a likelihood of benefiting from the inter¬ 
vention under study. For example, a trial might seek to determine 
the effect of a lipid-lowering drug on future myocardial infarction, 
revascularization, or cardiovascular death. However, the impact of 
therapy on individual components of the composite may vary, 
and not all components of the composite are likely to be of equal 
clinical importance. 

Another commonly employed means to limit trial sample size is 
to choose an outcome that is measurable in all study participants on 
a quantitative scale. For example, in studying a new antihyperten¬ 
sive, the initial studies are likely to assess impact on average blood 
pressure, rather than rates of stroke or kidney failure. In this example, 
the blood pressure is a surrogate outcome and one would have to 
rely on data from other sources to judge the likely impact on disease 
events of the degree of blood pressure lowering observed. 

A surrogate outcome is defined as a (bio)marker that is intended 
to serve as a substitute for a clinically meaningful end point and is 
expected to predict the effect of a therapeutic intervention. Some 
examples of surrogate outcomes (and the associated ultimate out¬ 
comes) include proteinuria (kidney failure, death, cardiovascular 
events), LDL cholesterol level (cardiovascular events and death), 
and left ventricular function (heart failure and death). The advan¬ 
tages associated with using surrogate outcomes as end points in 
trials of therapy include the smaller sample size as discussed above, 
as well as the possibility of being able to demonstrate an effect 
within a shorter time, thus also lowering the cost of the study. 
As such, early phase trials of an intervention often use surrogates as 
primary outcome measures. The results of the trials might not be 
considered definitive enough to drive a change in clinical practice 
(although they are sometimes marketed in an effort to do so), but 
the data may be persuasive enough to justify more expensive larger 
trials with clinically important outcomes. Measurement of surro¬ 
gates during trials of therapy in which clinically relevant end points 
make up the primary outcome may be helpful in understanding 
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how a therapy works. Consider a trial comparing a new drug, 
already known to lower LDL cholesterol level and to have unique 
effects on inflammation or oxidation pathways, to a statin in terms 
of cardiovascular event reduction. In such a trial, analyses might 
seek to determine whether any differential effect of the interven¬ 
tions on inflammatory or oxidation markers was associated with any 
difference in clinical outcomes. If such an association was found, 
the new drug might next be tested in other disease states in which 
inflammation or oxidation were thought to play a role. 

Just because a marker is associated with a disease of interest does 
not necessarily imply that it will be a valid substitute for clinically 
relevant outcomes in trials of therapy for that condition. For example, 
low HDL levels are associated with progression of atherosclerotic 
vascular disease. However, in a trial of torcetrapib, HDL levels were 
increased but there was no effect on progression of coronary athero¬ 
sclerosis [14]. A valid surrogate needs to satisfy the following 
conditions: 

• Be predictive of clinically important outcomes 

• Predict corresponding changes in clinically important outcomes 
when itself changed by therapy 

• The way therapy affects the surrogate should at least partly 
explain how therapy affects the clinically relevant outcome 

• In the case of a surrogate for drug effects, the dose response 
should be similar for the surrogate and the clinical effects 

It should be noted that a measure may be a valid surrogate for 
some, but not all clinically important end points. In addition, a 
surrogate that is valid for the effects of one intervention may not 
be a valid surrogate for other interventions. For example, both 
statins and sevelamer lower serum LDL cholesterol. A lower serum 
LDL cholesterol level may be a valid surrogate for the impact of 
statins on vascular disease, as reduction in LDL levels in response 
to statins has been associated with reduced cardiovascular events 
in numerous trials, and a dose response relationship was also found 
in trials comparing doses [15]. However, it would not then be 
correct to assume that sevelamer would have the same impact on 
cardiovascular events as a statin if given in doses that had a compa¬ 
rable effect on LDL levels. It could well be that the benefit of 
statins is linked both to how they lower LDL cholesterol as well as 
to other parallel effects not shared by sevelamer. 

To validate a surrogate outcome requires that it be measured 
during a trial testing the effect of the therapy of interest on clinically 
relevant outcomes. Demonstrating a simple correlation between 
the effect of therapy on the surrogate and on the clinical outcome 
is not sufficient to declare the surrogate valid [16]. This is because, 
as shown by Baker (Fig. 1), the slope of the relationship between 
the effect of an experimental therapy E on surrogate and true 
outcome may differ from that of a control therapy C. Even when a 
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higher level of the surrogate is perfectly linearly associated with a 
greater frequency of clinical events under either therapy, the differ¬ 
ence in slopes can lead to a lower level of the surrogate, but a higher 
frequency of adverse events with therapy E than therapy C. To 
avoid this error, Prentice contended that the distribution of true 
end points given the observed distribution of surrogate end points 
should not be conditional on treatment group [17]. Graphically 
this implies that the slope of the lines in Fig. 1 be the same. However, 
this is overly restrictive, as one only has to know the intercept and 
slope of the lines to make the correct inference. One can use data 
from a validation study to estimate the slopes and intercepts. One 
approach to validating a surrogate relies on hypothesis testing. If in 
validation trials, the Prentice criterion is met and the slopes and 
intercepts are similar, then rejection of the null hypothesis for the 
surrogate will imply rejection of the null hypothesis for the clinical 
end point. On the other hand, if rejection of the null for the sur¬ 
rogate does not imply rejection of the null for the clinical end point, 
the surrogate has not been validated. However, this approach is 
restrictive and does not identify all potentially valid surrogates. 
An alternative meta-analytic approach uses regression to predict 
separately the effect of intervention on surrogate and true end 
points [18]. An advantage of this approach lies in the ability to 
examine surrogates for harmful as well as beneficial effects of therapy. 
With this approach, pooled data from several prior intervention 
studies are analyzed to develop an estimate of the effect of therapy 
on surrogate and clinical outcomes. These estimates can be vali¬ 
dated in a further trial by comparing the predicted to the observed 
effect of therapy on both surrogate and clinical outcome. The sur¬ 
rogate may be considered valid if the prior estimate and the newly 
observed effects are sufficiently similar. What constitutes sufficient 
similarity requires medical judgment as well as consideration of 
statistics. It should be noted that this whole process is dependent 
on there being sufficient data from prior trials to develop adequately 
precise estimates of the effect of therapy on surrogate and clinical 
outcomes. The decision as to what constitutes adequate validation 
is also not hard and fast and how surrogates are validated should 
always be scrutinized before relying heavily on conclusions of future 
trials using the surrogate as primary end point. 

A few caveats need emphasis in relation to surrogate outcomes. 
Validating a surrogate for one population, intervention or clinical 
outcome does not imply that the surrogate will be valid for another 
population, intervention or clinical outcome. A valid surrogate for 
benefits may not be a valid surrogate for harms associated with an 
intervention. With these caveats considered, one might question 
whether there is any real advantage to using surrogates. However, 
useful surrogates may become accepted if they show a similar 
relationship to clinical end points under multiple interventions and 
in different disease states. 
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Chapter 13 


Randomized Controlled Trials 5: Determining the Sample 
Size and Power for Clinical Trials and Cohort Studies 

Tom Greene 

Abstract 

Performing well-powered randomized controlled trials is of fundamental importance in clinical research. 
The goal of sample size calculations is to assure that statistical power is acceptable while maintaining a small 
probability of a type I error. This chapter overviews the fundamentals of sample size calculation for stan¬ 
dard types of outcomes for two-group studies. It considers (1) the problems of determining the size of the 
treatment effect that the studies will be designed to detect, (2) the modifications to sample size calcula¬ 
tions to account for loss to follow-up and nonadherence, (3) the options when initial calculations indicate 
that the feasible sample size is insufficient to provide adequate power, and (4) the implication of using 
multiple primary endpoints. Sample size estimates for longitudinal cohort studies must take account of 
confounding by baseline factors. 

Key words Sample size estimation, Randomized clinical trials, Cohort studies, Type I error, 
Type II error, Power 


1 Introduction 


Inferences in clinical research face multiple sources of uncertainty, 
including bias from uncontrolled confounding, selection bias, 
errors in generalizing results from a specific study to clinical prac¬ 
tice, as well as errors resulting from random variation between the 
study sample and the population from which the sample was drawn 
[ 1 ]. In contrast to first three of these sources of uncertainty, where 
quantification of error is usually limited to sensitivity analyses or 
gross error bounds, uncertainty associated with random variation 
in the study sample can be quantified with probability theory 
Using probability theory, it is possible to derive mathematical rela¬ 
tionships between the sample size of the study and probabilities 
that random sampling error would lead to false positive or false 
negative conclusions. 

Notwithstanding its mathematical precision, the relationship 
between sample size and the risk of false conclusions depends on 
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characteristics of the study population and of the future conduct of 
the trial which may not be fully understood when the study is 
designed [2, 3]. The selection of the sample size also depends on 
the magnitude of the effect the study should be designed to detect, 
a complex problem which creates an additional complication for 
investigators [4, 5]. Finally, because logistical constraints often 
limit the feasible sample size, control of the risk of false positive 
and false negative errors often entails consideration of alternative 
outcomes or alternative study designs which have a greater proba¬ 
bility of detecting an effect of the treatment with a smaller sample 
size. Thus the exercise of sample size calculation encompasses not 
only the derivation of probabilistic relationships between sample 
size and the risk of false conclusions, but more fundamentally the 
elaboration of the assumptions going into the calculations, the 
determination of the appropriate effect size, and the selection of 
the primary outcome and study design. This chapter examines each 
of these elements of sample size calculation. We focus initially on 
comparisons between treatment and control groups in randomized 
clinical trials and subsequently extend the discussion to longitudi¬ 
nal cohort studies. 

The chapter is organized as follows. Subheading 2 reviews the 
concept of statistical power within the classical hypothesis testing 
framework and examines the implications of statistical power for 
the positive and negative predictive value associated with the find¬ 
ings of a study. Using this framework, we review the fundamental 
importance of conducting well-powered studies in clinical research. 
Subheading 3 overviews the fundamentals of sample size calcula¬ 
tion for standard types of outcomes for two-group studies. 
Subheading 4 considers the problem of determining the size of the 
treatment effect that the study will be designed to detect. 
Subheading 5 reviews modifications to sample size calculations to 
account for loss to follow-up and nonadherence. Subheading 6 
considers steps which may be taken when the initial sample size 
calculations indicate that the feasible sample size is insufficient to 
provide adequate power. This section includes an examination of 
the use of composite endpoints. Subheading 7 examines the impli¬ 
cations of multiple primary endpoints for sample size calculation. 
Finally, Subheading 8 addresses modifications to sample size calcu¬ 
lations for randomized trials which are needed for nonrandomized 
comparisons in longitudinal cohort studies. 


2 The Importance of Statistical Power 

Under classical hypothesis testing, the statistical inferences of a 
randomized clinical trial comparing a treatment to a control group 
are interpreted as determining whether the trial results provide suf¬ 
ficient evidence to conclude, with low risk of error, that the null 


2.1 Definition of 
Statisticai Power 
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Fig. 1 Classical hypothesis testing. Displayed are the two types of erroneous 
conclusions under classical hypothesis testing for a comparison of a treatment to 
a control: type 1 error, which results when the null hypothesis of no effect of the 
treatment is erroneously rejected, and type II error, which results when a study 
fails to detect a true effect of the treatment 


hypothesis of no difference in outcome between the treatment and 
control groups should be rejected in favor a research hypothesis 
which corresponds to a clinically or biologically important differ¬ 
ence [6]. In this framework, there are two classes of error: A type 1 
error occurs if the investigators erroneously reject the null hypoth¬ 
esis when the null hypothesis is true, and a type 2 error occurs if the 
investigators fail to reject the null hypothesis when the research 
hypothesis is true ( see Fig. 1). The statistical power of the trial is 
defined by subtracting the probability of making a type 2 error 
from 1 and represents the probability of rejecting the null hypoth¬ 
esis in favor of the research hypothesis when the research hypoth¬ 
esis is true. The goal of sample size calculations is to determine the 
sample size necessary to assure that statistical power exceeds an 
acceptable minimum threshold, typically 0.80-0.95, while assuring 
that the probability of type 1 error is sufficiently small, typically 
between 0.01 and 0.05. The probabilities of type 1 and type 2 
errors are often referred to by the symbols a and /?, respectively. 

From a mathematical perspective, statistical power is a function 
which assigns the probability of rejecting the null hypothesis for all 
possible treatment effects, including the particular treatment 
effects corresponding to the null and research hypotheses as special 
cases (see Fig. 2). In most cases, the power function equals a when 
the null hypothesis is true and increases continuously as a smooth 
curve for nonzero effects until the size of the treatment effect 
reaches the effect size designated by the research hypothesis, and 
eventually increases to approximately 1 for arbitrarily large treat¬ 
ment effects. A corollary is that regardless of the sample size, one 
can define the statistical power to be anywhere between the n-level 
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Mean Difference in Outcome Between Treatment and Control 

Fig. 2 Power curve for two-sided test. Displayed is the power curve for a two-sided text comparing the mean 
values of a continuous outcome between treatment and control groups. The curve indicates the probability that 
the null hypothesis is rejected for each possible difference in the outcome means between the two groups. The 
power is equal to the a-level (0.05 here) when the null hypothesis of no treatment difference is true, and 
increases to 1 as the differences in mean values increases. In this example, the power is 0.696 for a “moder¬ 
ate” differences in the means equal to one-half of 1 standard deviation in the outcome variable 


and 1, depending on what effect size is stipulated for the research 
hypothesis. This underscores the importance of the appropriate 
selection of the effect size in sample size calculations. 

2.2 Importance There is an extensive body of work spanning both the medical and 

Of Conducting statistical literatures arguing for the importance of conducting 

Well-Powered Studies well-powered studies. 

Rationale for conducting well-powered studies includes: 

(a) Use of resources. The conduct of medical research requires 
expenditure of investigator time and financial resources, which 
generally come directly or indirectly from funding provided by 
governments, philanthropic organizations, or profits derived 
from medical care. Conduct of underpowered studies has been 
criticized for expending investigator time and funding on 
studies of low value, thus diverting limited resources from 
more useful research [7]. 

(b) Ethics in relation to study participants. Multiple studies have 
shown that individuals often participate in medical research for 
altruistic motives, to participate in the advance of medical sci¬ 
ence and help future patients [8-10]. Because underpowered 
studies may not adequately test the study hypothesis, they have 
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Positive Predictive Value Negative Predictive Value 



Fig. 3 Positive and negative predictive value and power. Shown are the relationships of positive and negative 
predictive value to statistical power assuming /? values of 1 ( black ), 0.333 (red), and 0.10 (green), respectively. 
Both positive and negative predictive values are reduced when power is low 


been considered of limited scientific value and therefore 
unethical in their exposure of subjects to the risks and burdens 
of medical research [10-12] without the benefit of signifi¬ 
cantly adding to medical knowledge. 

(c) Effects of low power on positive and negative predictive values of 
study findings. Recently there has been increased emphasis on 
the implications of statistical power for the positive and negative 
predictive values of results published by research programs [13- 
15]. The risk that a publication by a research program is errone¬ 
ous is defined either by the probability of a false positive finding 
given a positive reported result or by the probability of a false 
negative finding given a negative reported result. These error 
probabilities can be interpreted as 1 minus the positive and neg¬ 
ative predictive values of the study results, respectively. In a sim¬ 
plified model where only the two possible treatment effects 
corresponding to the null and research hypotheses are consid¬ 
ered, the positive and negative predictive values are readily cal¬ 
culated from a and /? along with the ratio, usually denoted R, of 
the proportion of the research program’s studies with true 
research hypotheses versus the proportion of the program’s 
studies with true null hypotheses [13]. Figure 3 displays the 
positive and negative predictive value as a function of power for 
R= 1 /16 and R= 1 when a = 0.05. Low power leads to substan¬ 
tially reduced positive and negative predictive value at each value 
of R, implying that both the positive and negative findings 
reported by the research program may have high risks of error. 
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Of note, when power is low, the risk that a finding reported as 
positive is actually a false positive result can be substantially 
greater than the a-level of 0.05. Positive predictive value is also 
reduced when R is low, a point we expand on below. 

The conduct of underpowered studies may be accompa¬ 
nied by additional shortcomings which can exacerbate the 
above difficulties. These include: 

(d) Publication bias. When an underpowered study produces a 
negative outcome, the precision of the study will typically be 
insufficient to rule out a clinically important effect that went 
undetected due to low power, emphasizing the inability of the 
study design to answer the research question. As a consequence 
either of this limitation, or a general lack of enthusiasm for 
negative findings, negative findings from underpowered studies 
may not be submitted for publication, or, if submitted, they 
may be less likely to survive the review process. This phenom¬ 
enon, which has been demonstrated in multiple areas of inves¬ 
tigation using meta-analytic techniques [16, 17], can lead to a 
distorted body of evidence in the medical literature which 
falsely suggests a beneficial effect of an ineffective treatment or 
which exaggerates the benefit of a treatment with a small effect. 

(e) Pailurc to report confidence intervals. The importance of pre¬ 
senting confidence intervals to indicate the precision of esti¬ 
mated treatment effects has been widely emphasized [18, 19]. 
In particular, in the absence of a confidence interval, a negative 
finding may be interpreted incorrectly as demonstrating the 
absence of an effect when presentation of a confidence interval 
would show that the results are compatible both with no effect 
and with a clinically important effect. 

(f) Pailure to disclose low power. Medical journals may be reluctant 
to publish studies where the power of the primary analysis is 
reported to be less than 80 %. This often results in a sort of 
dance, where researchers modify the parameter values, includ¬ 
ing the targeted treatment effect under the research hypothe¬ 
sis, until the calculated power research is 80 % or higher [20]. 
In fact, as demonstrated in Fig. 2, regardless of the study’s 
sample size it is almost always possible to achieve any desired 
statistical power simply by “hypothesizing” a sufficiently large 
effect. Unfortunately, if the sample size is selected to reflect a 
hypothesized effect which is not biologically plausible, the 
consequences are analogous to reducing the ratio R in Fig. 3, 
leading both to poor positive predictive value and to a low true 
power. This practice has been viewed as leading to distorted 
presentations of the evidence to the research community 
and as misleading trial participants into believing that they are 
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participating in a well-powered study when in fact the study is 
unable to come to a clear conclusion [12]. 

2.3 Are Because there are practical barriers for the conduct of well- 

Underpowered Studies powered trials, a number of authors have argued that efforts to 

Ever Justified? prohibit underpowered trials may thwart many investigations, 

limiting the breadth of medical research [20, 21]. Under this 
view, underpowered trials may be deemed of value as long as 
they provide confidence intervals to display the precision of their 
results and as long as they are published irrespective of whether 
their results are positive or negative so that they can be sub¬ 
sumed into subsequent meta-analyses in an unbiased fashion. 
This view is encouraged by progress in registering trials in 
national data bases, mitigating publication bias [22]. Others 
have argued that methodological limitations of meta-analyses 
limit the utility of this approach, and that presentation of confi¬ 
dence intervals, while important, cannot overcome the fact that 
findings are uninformative if the confidence intervals are so wide 
as to include both the null hypothesis and clinically important 
effects [12, 23]. Under the latter view, underpowered studies 
may be considered ethical only in special cases, such as rare dis¬ 
eases where larger trials are infeasible, or in pilot studies where 
the primary goals involve feasibility assessments distinct from the 
determination of the treatment effect on outcomes. An exten¬ 
sion of this perspective would hold that well-conducted and 
properly reported underpowered trials may be acceptable if alter¬ 
native design and analysis options have been explored and found 
to be infeasible or inadequate. Recently, efforts have been made 
to organize prospective meta-analyses of separate studies in 
which investigators adhere to standardized treatment compari¬ 
sons and primary outcomes to support a valid meta-analysis of 
the primary outcome, while allowing variations among partici¬ 
pating centers in secondary endpoints and other noncentral 
aspects of the protocol [24, 25]. 


3 Basics of Sample Size Calculation 

This section illustrates the mechanics of sample size calculation for 
three types of outcomes: (1) continuous outcomes such as blood 
pressure or serum lipids levels, (2) binary outcomes such as success 
or failure, and (3) survival outcomes defined by the time to the 
occurrence of a clinical event. The statistical literature presenting 
sample size calculations for different settings is vast, and we will 
not attempt a complete overview but rather illustrate the main fea¬ 
tures of sample size calculation with these core examples. Detailed 
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3.1 Mechanics 
of Sample Size 
Calculation 


presentations of sample size calculations are given in texts by Chow 
et al. [26], Cohen [27], and Machin et al. [28]. 

The second column of Table 1 presents standard formulae relat¬ 
ing the required sample size in each group of a two-group study 
for two-sided tests with equal sample sizes per group to the fol¬ 
lowing quantities: 

1. The initial multiplying factor 2, which accounts for the fact that 
the comparison of the outcome between the treatment and con¬ 
trol groups involves a comparison of two random quantities. 

2. The second term, expressed either as (t a/2 + ) or (z a/2 + z p ) , 

determines the effect of the type 1 error a and power 1 - j} on 
the required sample size. Here ta /2 and tfi represent quantiles 
from the t-distribution while m /2 and zfi are quantiles from the 
normal distribution. The type 1 error a is divided by 2 to 
account for the designation of a two-sided hypothesis test; if a 
one-sided test is performed, ta /2 and za /2 are replaced by ta and 
za, respectively. When the sample size is large, ta /2 and tf) are 
approximately equal to za /2 and 2 /?, but ta /2 and tf) are slightly 
larger than za /2 and 2 /? for small sample sizes to account for the 
uncertainty in estimating the standard deviation. Typical values 
of za / 2 vary between 1.96 for a = 0.05 and 2.58 for a = 0.01, 
and 2 /? varies between 0.84 for 80 % power and 1.28 for 90 % 
power. It follows that using a = 0.01 instead of a = 0.05 requires 
an approximately 49 % larger sample size when power is 80 and 
a 42 % larger sample size when power is 90 %. Similarly, using 
90 % instead of 80 % power requires an approximate 34 % 
increase in sample size when a = 0.05 and a 27 % increase when 
a = 0.01. In general, the clinical trials literature advocates the 
use of two-sided tests over one-sided tests, in part to acknowl¬ 
edge the possibility of an adverse effect of the treatment and in 
part to maintain consistency of reporting between different 
clinical trials [2]. When one-sided tests are performed, it is gen¬ 
erally recommend that the significance level be set to one-half 
the value that would have been used for a two-sided test; this 
assures that the investigator’s decision of whether to use a one- 
or two-sided test does not change the criteria for concluding a 
statistically significant effect. 

3. The third term, , represents the inverse of the square of the 

8 

hypothesized effect size 8. For continuous outcomes, the effect 
size is defined as the difference ju r - ju c in the mean of the out¬ 
come between the treatment and control groups under the 
research hypothesis. Similarly, for binary outcomes, the effect 
size is defined as the hypothesized difference n r - n c in the 
proportions between the treatment and control groups. In sur¬ 
vival analysis, the effect size is given by the logarithm of the 
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ratio of the hazard rates between the treatment and control 
groups under the research hypothesis. The fact that the effect 
size is squared in the expression for the required N has a pro¬ 
found impact on the practice of sample size calculation, as even 
relatively modest differences in the hypothesized effect size can 
have a very large effect on the required sample size. For exam¬ 
ple, suppose that 50 % of subjects are expected to fail in the 
control group in a study with a binary outcome. Then, hypoth¬ 
esizing that the treatment will lead to a 20 % instead of a 30 % 
relative reduction in the failure rate would require more than a 
doubling in the required sample size. 

4. The final term, which varies between each type of outcome, 
represents the variability of the outcome variable for a single 
patient, and in statistical theory represents the inverse of the 
amount of information contributed by a single patient to the 
analysis. Consideration of this variability term accounts for the 
characteristics of the study population and allows investigators 
to weigh alternative options for selection of the primary out¬ 
come and research design. 

For a continuous outcome, the variability of the outcome for a 
single patient is defined by the square of the standard deviation of 
the outcome in the study population. In many cases, variability can 
be reduced (and information per patient increased) by analyzing 
the change in a continuous variable from baseline, thereby control¬ 
ling for inter-patient variation at the baseline assessment. For anal¬ 
yses of changes between a baseline and follow-up assessment, the 
variability per patient is expressed as 2cr 2 (l — R) where R repre¬ 
sents the correlation of the outcome between the baseline and 
follow-up assessments. Thus, if R = 0.80, analysis of change from 
baseline will reduce the variability term, and hence the required 
sample size, by 60 % relative to an analysis of the outcome without 
subtracting the baseline level. It is important to note, however, 
that analysis of changes from baseline leads to increased variability 
if R is less than 0.50; if R = 0, analysis of change scores would 
double the required sample size compared to analysis of the fol¬ 
low-up outcome variables without consideration of the baseline 
values. This risk of a loss of power when using change scores can 
be eliminated by using an analysis of covariance (ANCOVA) model 
in which the changes from baseline to follow-up are compared 
between the treatment and control groups after controlling for the 
baseline levels. The variability term becomes cr 2 (l - H 2 ), which is 
always equal to or smaller than the variability terms both for the 
analysis of the follow-up value alone and for the analysis of change 
scores [29]. Compared to analysis of change scores without adjust¬ 
ment for the baseline levels, analysis of covariance provides a 50 %, 
37.5 %, 25 %, or 12.5 % lower required sample size if R = 0, 0.25, 
0.50, or 0.75, respectively. 
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3.2 Determining 
Study Population Input 
Parameters 


Outcomes are often measured repeatedly at multiple follow-up 
times. This provides the investigators with numerous analysis 
options for contrasting the outcome between the treatment and 
control group, including, for example, comparisons of the out¬ 
come at the final scheduled visit, the average outcome value across 
all follow-up visits, or comparisons of the mean slope over the 
baseline and follow-up periods under the assumption of a linear 
model for change over time. If the follow-up measurements are 
equally spaced in time, the variability term is given by 


f 


°B + 


12(P - l)cr 


2 A 


error 


V 


0 x where <r B represents the standard deviation 

D 2 P(P + 1) 

of the true underlying slopes, cr erroi 2 represents the standard devia¬ 
tion of the residuals of the outcome measurements from the under¬ 
lying linear trajectories, D represents the total duration of follow-up, 
and P represents the number of measurements. This expression 
allows investigators to assess the impact of use of different follow¬ 
up times and measurement frequencies on the required sample size 
[30, 31]. 

For binary outcomes, the variability term is given by n (l - 7r) , 
where n represents the average probability of the outcome across 
the treatment and control groups under the research hypothesis. 
For time-to-event analyses, the information per patient is approxi¬ 
mated by the proportion of enrolled patients who experience 
events during the follow-up period of the study. This proportion is 
determined by the event rate in the study population in conjunc¬ 
tion with the durations of the planned enrollment and subsequent 
follow-up periods of the trial [32]. 


The practical challenge for calculation of sample size is the deter¬ 
mination of the effect size and the study population input param¬ 
eters listed in the final column of Table 1. We consider these issues 
during the remainder of this section and the following section. 

The ideal approach to estimating the study population input 
parameters is to perform a systematic review of past research pro¬ 
viding summary statistics for similar populations and then pool 
estimates of the input parameters across these studies. Importantly, 
it is not necessary to identify studies with interventions similar to 
the intervention of the proposed study for this exercise. Estimates 
of standard deviations for continuous outcomes may reasonably 
consider standard deviation estimates from either treated or con¬ 
trol patients in past studies, while estimates of the control group 
proportion n c or the control group hazard rate should ideally be 
obtained from populations treated similarly to the control arm of 
the planned study. It is important to avoid two common method¬ 
ological errors during this process. First, in order to obtain a 
smaller required sample size, investigators may be tempted to scan 
the literature for studies that report favorable standard deviations, 
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outcome proportions, or hazard rates rather than perform an unbi¬ 
ased review. This practice will lead to falsely optimistic estimates of 
the input parameters compared to those which are observed in the 
actual study, leading to failure to achieve the desired power. 
Second, estimates of the population input parameters may be 
imprecise if obtained from studies with small sample sizes. This 
issue is especially acute for estimates of variability. For example, 
with a sample size of 5, the upper endpoints of 75 and 90 % upper 
one-sided confidence intervals for the standard deviation exceed 
the reported standard deviation by 44.2 and 93.9 %, respectively, 
reflecting 2.08 and 3.76-fold differences in the required sample 
size. With a sample size of 10, the upper endpoints of 75 and 90 % 
upper one-sided confidence intervals are 23.5 and 46.9 % higher 
than the reported standard deviation, corresponding to 1.53- and 
2.16-fold differences in the required sample size. Due to this 
uncertainty, it can be advisable to use a 75th or higher upper con¬ 
fidence limit for the standard deviation when small studies are used 
to obtain the input parameters for sample size calculations or else 
to formally account for the uncertainty of the estimated standard 
deviation in the calculation [33]. A third complication arises when 
considering outcomes defined by adverse events such as mortality. 
Because the risk of some adverse outcomes is declining over time, 
event rates from earlier studies may overestimate the risk of the 
adverse outcome in the control group of the study being designed, 
leading to underestimation of the required sample size. In addi¬ 
tion, patients who enroll in clinical trials often have lower rates of 
adverse outcomes than observed in the general population; this 
should be taken into account when applying adverse event rates 
taken from general population or other observational studies to 
when designing a clinical trial. In general, if there is uncertainty 
regarding an input parameter, experience has shown that it is pru¬ 
dent to err on the side of considering less favorable values for the 
parameter as inputs for sample size calculation. 

3.3 Pilot Studies Some of the above issues can be addressed by performing either 

external or internal pilot studies. Under this approach, a prelimi¬ 
nary sample size target is set based on information from other 
studies, but the estimated sample size may be adjusted based on 
estimates of variability or of event rates observed in a pilot study 
whose entry criteria, outcome measurements, and other protocol 
elements are formulated to match those of the full-scale trial [3, 
34, 35]. The pilot is termed “external” if it is conducted prior to 
the start of the full-scale trial and the outcome results of the pilot 
are not included in the analysis of the full-scale trial, and internal if 
its results are included in the analysis of the full-scale trial. If the 
results of the internal or external pilot are used only to refine esti¬ 
mates of variability or group event rates, and not estimates of the 
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treatment effect, adjustments to the sample size of the full-scale 
trial can be implemented with little or no adjustment of the signifi¬ 
cance level [2]. 


4 Choosing the Effect Size 

Three criteria for choosing the hypothesized effect size have been 
discussed for sample size calculation: (1) clinical importance, (2) 
biologic plausibility, and (3) published criteria for “small,” 
“medium,” and “large” effects. All three of the criteria involve a 
certain degree of subjective judgment. 

4.1 Minimum A trial should ideally be designed to detect the “minimum clini- 

Clinically important cally important effect” with high power, thus limiting the risk that 

Effect an important treatment effect is missed due to a type 2 error. For 

outcomes defined by clinical events, the minimum clinically impor¬ 
tant effect size can be assessed by having experts assess the smallest 
reduction in the event rate felt to justify application of the inter¬ 
vention. This assessment can be aided by computing the number 
needed to treat, defined as the number of patients who would need 
to be treated to prevent one event. For binary outcomes, this is 
( 1 / n c - n T ) . Judgments of the minimum clinically important 
effect may also depend on the prevalence of the condition being 
treated, as the total number of events that could be prevented by 
an intervention is the ratio of the size of the targeted clinical popu¬ 
lation and the number needed to treat; for binary outcomes, this is 

^population / ( n c ~ n <::\ ) • For very lar g e Patient populations, even 
treatments leading to relatively small n c - may prevent large 
numbers of adverse outcomes. The size of the minimum clinically 
important effect may also be informed by the side effect burden of 
the treatment, with larger effect sizes needed to warrant applica¬ 
tion of the treatment for treatments with more severe side effects. 
For quality of life measures, the minimum clinically important 
effect has been linked to the minimum clinically important differ¬ 
ence, which has been evaluated for many common measures [36]. 
For biomarker measurements, empirical relationships from previ¬ 
ous studies may be used to link differences in the biomarker to 
differences in clinical outcome, allowing the investigators to iden¬ 
tify the difference in the biomarker outcome which is associated 
with a minimum clinically important difference in the clinical out¬ 
come. For example, if a 5 mmHg reduction in blood pressure has 
been found to be associated with a 15 % relative hazard reduction 
of adverse cardiovascular in a population, and a 15 % relative haz¬ 
ard reduction is viewed as a clinically important effect, then a 
5 mmHg blood pressure reduction may also be interpreted as clini¬ 
cally important for the purpose of power calculations. 
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4.2 Biologically 
Plausible Effect 


4.3 Standardized 
Effect Size 
Conventions 


4.4 Noninferiority 
Trials 


The biologically plausible effect size is assessed primarily by consider¬ 
ing the magnitudes of treatment effects observed for similar interven¬ 
tions on similar outcomes in previous studies. Ideally, a meta-analysis 
may be carried out based on previous studies of treatments in the 
same class of treatments to provide a provisional estimate of the treat¬ 
ment effect to be tested in a new trial. In many cases, the treatment 
class under investigation will not have been tested in a previous study; 
in this situation, investigators may consider the average effect size 
observed in studies of other treatments having comparable evidence 
in support of an effect at the time those treatments were evaluated. 
Two major cautions are warranted when evaluating the range of bio¬ 
logically plausible effect sizes: First, due to publication bias, estimates 
of treatment effects from early studies in a particular area of investiga¬ 
tion may be skewed to overestimate the true treatment effect, par¬ 
ticularly if these studies are small. Second, small studies, including 
internal or external pilot studies, generally do not provide useful esti¬ 
mates of effect size for power calculations. Such studies are not pow¬ 
ered to determine the effect of the treatment on outcome; 
consequently, confidence intervals for the treatment effect from these 
studies typically include both no effect and very large effects and are 
consistent with required sample sizes ranging from a small number to 
positive infinity. 

The third approach to determining the effect size is to rely on rules 
of thumb which have been developed in the literature. The most 
notable of these are Cohen’s “small,” “medium,” and “large” 
effect sizes for continuous outcomes, defined in terms of the ratio 
of the hypothesized mean difference to the standard deviation in 
the study population. Cohen termed effect sizes standardized in 
this way as small if they are between 0.2 and 0.3, medium if they 
are approximately 0.5, and large if they exceed 0.8. While arbitrary, 
the criteria that studies should usually be powered to detect 
medium effects have been widely used in the literature [27]. Due 
to the arbitrariness of this approach, the use of such criteria for 
sample size calculation is generally recommended for medical stud¬ 
ies only when absence of information renders assessment of the 
minimum clinically important effect and the range of biological 
plausible effects too difficult [12]. 

The preceding discussion has been framed in the context where the 
research objective is to determine the superiority of one treatment 
with respect to a control. In some cases, the research objective is to 
determine if a new candidate intervention, which may reduce cost 
or lead to fewer side effects, is at least as effective as an existing 
standard intervention. The fundamental challenge in noninferior¬ 
ity trials is that it is not possible to demonstrate exact equivalence 
between interventions. Hence, sample size calculations in noninfe¬ 
riority designs are based on a “noninferiority margin” 8 selected, so 
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the treatment effects smaller than 8 can be regarded as “clinically 
equivalent” [37, 38]. In general, the noninferiority margin should 
be no larger than the minimum clinically important effect. 


5 Accounting for Loss to Follow-Up and Nonadherence 

5.1 Loss Documented procedures to minimize loss to follow-up are an 

to Follow-Up essential element of well-designed clinical trials and cohort studies. 

Even so, it is usually the case that some patients are lost to follow¬ 
up prior to the ascertainment of the primary outcome. Hence, for 
outcomes evaluated at a single follow-up time, it is standard to 
inflate the sample size of the study design by a factor 1/(1 -Pi oss ), 
where Pi oss is the proportion of patients projected to be lost to 
follow-up prior to the outcome assessment. For example, if the 
required sample size is 1,000 patients but 15 % are expected to be 
lost to follow-up, the target sample size would be set to 1,000/ 
(1 -0.15) = 1,177. Adjustment for loss to follow-up is more com¬ 
plex for time-to-event outcomes, as patients who are lost during 
the follow-up period will contribute partial information to the 
analysis for the period prior to the time the patient is censored. 
Hence, standard sample size software for time-to-event analyses 
requires the user to provide rates of loss to follow-up over the 
course of the study. Similar considerations apply to longitudinal 
outcomes such as slopes or the mean value of the outcome through¬ 
out the follow-up period [39]. 

5.2 Nonadherence Statistical power is also reduced when a subset of patients assigned 

to the treatment group discontinues the treatment (treatment 
dropouts) or when a subset of patients assigned to the control 
group start the treatment (treatment drop-ins). While the occur¬ 
rence of treatment discontinuation superficially resembles loss to 
follow-up, the implications of these two processes are fundamen¬ 
tally distinct for the study design and sample size calculation. In 
accordance with the principle of intention-to-treat, it is essential 
that all efforts be made to continue to collect outcome information 
after treatment discontinuation. Since patients who discontinue 
treatment are likely to have different clinical characteristics than 
patients who remain on treatment, exclusion of these patients 
would often lead to important differences in the characteristics of 
the treatment and control group patients retained in the primary 
analysis, negating the purpose of the randomized design. 

The consequences of treatment drop-ins and dropouts for sta¬ 
tistical power are substantial and generally greater than the conse¬ 
quences of comparable rates of loss to follow-up. This can be 
illustrated most easily for a study comparing a binary outcome 
between a treatment and control, assuming that the probability of 
the outcome for treatment dropouts reverts to the outcome 
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Table 2 

Impact of nonadherence on required sample size 


Treatment 

dropout 

(%) 

Treatment 
drop-in (%) 

Overall % 
nonadherent 

(%) 

Hypothesized 
biological 
effect (%) 

Hypothesized 
intent-to-treat 
effect (%) 

Fold increase 
in required N 
due to nonadherence 

10 

0 

5 

10 

9 

1.25 

20 

0 

10 

10 

8 

1.60 

30 

0 

15 

10 

7 

2.12 

40 

0 

20 

10 

6 

2.92 

50 

0 

25 

10 

5 

4.25 

10 

10 

10 

10 

8 

1.56 

20 

20 

20 

10 

6 

2.78 

30 

30 

30 

10 

4 

6.25 

40 

40 

40 

10 

2 

25.0 

50 

50 

50 

10 

0 

Infinite 


Shown are increases in required sample size with varying treatment dropout and drop-in rates for a binary outcome in 
an RCT with equal allocation to a treatment and control group, where the research hypothesis stipulated an outcome 
probability of 0.30 under the control and 0.20 under the treatment 


probability for patients consistently taking the control and that the 
outcome probability for treatment drop-ins from the control group 
assumes the outcome probability for patients consistently taking 
the treatment. As shown in Table 2, nonadherence in the form of 
treatment dropouts and drop-ins reduces the treatment effect 
under the intent-to-treat analysis relative to the biological effect of 
the treatment with 100 % adherence, which in turn inflates the 
required sample size. Because of the inverse square relationship 
between required sample size and the treatment effect, even mod¬ 
erate reductions in the size of the intent-to-treat treatment effect 
can greatly increase the required sample size; for example, a 20 % 
treatment dropout rate in conjunction with a 20 % treatment drop- 
in rate results in a 2.78-fold increase in required sample size. In 
comparison, a 20 % loss to follow-up in both the treatment and 
control groups would require a 1.25-fold increase in the required 
sample size. When considering nonadherence, power calculations 
are sometimes expressed in terms of the hypothesized biological 
effect that would hypothetically occur with perfect adherence and 
sometimes in terms of the intent-to-treat effect after accounting 
for nonadherence. The ultimate required sample size is the same in 
either case, as long as the intent-to-treat effect is appropriately dis¬ 
counted to account for the projected effect of nonadherence. We 
have found that it is useful for investigators to consider both the 
hypothesized biological effect and the intent-to-treat effect to 
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clarify the implications of nonadherence. Analogous to the loss to 
follow-up situation, accounting for treatment drop-ins and drop¬ 
outs is somewhat more complex in time-to-event analysis, as the 
expected effect size for patients who dropout or drop-in at a par¬ 
ticular follow-up time is usually assumed to match the full biologi¬ 
cal effect of the treatment until that time. Hence, sample size 
calculations for time-to-event outcomes also require assumptions 
regarding the rates of dropouts and drop-ins over successive fol¬ 
low-up intervals [40, 41]. 

5.3 Pragmatic Trials Due to the severe effect of nonadherence on power and required 

sample size, it has historically been a tenet of clinical trial design 
that efforts should be made to identify and exclude patients at risk 
for nonadherence prior to randomization and to include adherence 
promotion efforts after randomization. Recently, there has been 
increased concern that this approach may compromise the external 
generalizability of the trial results to clinical practice, and there has 
been a trend towards the so-called pragmatic trials in which broader 
patient populations are enrolled irrespective of the risk of nonad¬ 
herence [42]. Clearly, it is important that realistic projections of 
nonadherence be incorporated into pragmatic trial designs, with 
consequent upward adjustments to sample size. 


6 Options When the Initial Calculated Sample Size Is Low 

As described in Subheading 3, investigators may inflate hypothe¬ 
sized treatment effects to provide the appearance of a well-powered 
trial when power calculations indicate an infeasible required effect 
size, with adverse consequences for medical research. Table 3 sum¬ 
marizes more appropriate options when the initial required sample 
size appears to be infeasible. The first seven settings as well as 8a and 
9a represent scenarios in which the primary outcome can be modi¬ 
fied to increase statistical information provided per patient, thus 
reducing the number of patients required to achieve adequate power. 
For example, if a binary outcome is defined by the occurrence of a 
normally distributed biomarker value less than a designated thresh¬ 
old, then redefining the primary outcome as the biomarker itself, 
without dichotomization, typically reduces the required sample size 
by 30-60 %, depending on the threshold and size of the treatment 
effect. Basing the treatment group comparison on an average of 
repeated measurements of the primary outcome can lead to substan¬ 
tial reductions in required sample size if the longitudinal variability 
within the same patients is substantial compared to the variation of 
the outcome between patients. For example, if 50 % of the total vari¬ 
ance of the outcome is due to within-patient variability, then averag¬ 
ing 2, 3, or 4 repeated measurements would reduce the required 
sample size by approximately 25, 33, and 37.5 %, respectively. 
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Table 3 

Design options when calculated required sample size is infeasible 


Setting Design option to reduce required sample size 

1 Dichotomized continuous outcome (a) Revert to original continuous outcome 

2 Binary clinical endpoint with high probability (a) Perform time-to-event analysis 

of occurrence 

3 Time-to-event or slope-based analysis (a) Extend follow-up period 

4 Single follow-up assessment of continuous (a) Employ analysis of covariance 

outcome (b) Base analysis on multiple follow-up 

measurements 

5 Imprecise outcome (a) Employ outcome with improved precision 

6 Non-normal continuous outcome (a) Employ transformations to better 

approximate normality 

(b) Employ statistical model for non-normal 
outcome 

(c) Employ robust statistical methods 

7 Highly prognostic covariate available (a) Employ covariate adjustment or stratified 

analysis 

8 Rare dichotomous endpoint (binary and (a) Use composite of 2 or more events 

time-to-event outcomes) (b) Restrict analysis to “enriched” population 

with higher event rate 

9 Small hypothesized effect size (a) Consider alternative biomarker endpoints 

more proximate in the causal pathway to the 
treatment 

(b) Restrict study subpopulation with larger 
hypothesized effect size 

(c) Conduct explanatory trial with intensive 
adherence promotion 

10 Contamination of control group with (a) Conduct cluster randomized design 

treatment 

11 Treatment effect rapidly attenuates after (a) Employ cross-over design 

discontinuation 

12 Initial stages of investigation (a) Revert to pilot study addressing feasibility 

issues 

13 High per-patient cost of protocol (a) Simplify protocol to focus on primary 

outcome 

14 Multiple treatment arms (a) Simplify to 2 treatments 

(b) Organize treatment into factorial design 

15 Widespread interest in treatment (a) Organize or participate in multicenter trial 

(b) Organize or participate in prospective 
meta-analysis 
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Because the information per patient in time-to-event analyses is 
given by the number of events (Table 1), composite endpoints 
defined by more than one type of event are often used to reduce the 
number of patients who must be recruited to observe the required 
number of events [2, 43]. For example, a study of cardiovascular 
disease might consider a composite endpoint based on the first occur¬ 
rence either of death from coronary heart disease or nonfatal myocar¬ 
dial infarction. The savings in required sample size can be substantial; 
if two types of events are each projected to occur for 5 % of subjects 
during the planned follow-up period, a composite defined by the first 
occurrence of either event may reduce the required sample size by 
nearly 50 %. The use of composite endpoints is most widely recom¬ 
mended when each type of event in the composite reflects the same 
underlying condition or can be expected to respond to the same 
underlying mechanism of action of the treatment. The use of 
composite endpoints is controversial, however, when the different 
endpoints defining the composite are more disparate. In this case, 
interpretational difficulties may occur if the treatment effects on dif¬ 
ferent components of the composite go in different directions [44]. 
Additionally, if there is a substantial possibility that the treatment has 
no effect on one of the endpoints being considered for inclusion in a 
composite, then incorporation of that component may reduce the 
magnitude of the projected treatment effect and thus reduce power 
in spite of the increase in the total number of events. 

The remaining rows of Table 3 represent scenarios in which 
the study design can be modified to reduce the required sample 
size. We note that in general, cluster randomization can be expected 
to increase rather than decrease the required sample size due to 
correlations of the outcome between subjects belonging to the 
same cluster [45,46]. However, an exception to this rule can occur 
if the effects of a treatment intervention applied to some patients 
within a given cluster (e.g., hospital, clinic, or physician) are 
thought likely to spread to subjects in the control group within the 
same cluster. For example, if the treatment intervention instructs 
physicians to implement a new checklist to promote smoking ces¬ 
sation, then the physicians may find it difficult to keep themselves 
from implementing components of the checklist to control sub¬ 
jects as well. Such “contamination” may be prevented by random¬ 
ization at the cluster level, and if the risk of contamination is high, 
the increase in effect size may outweigh the effect of correlated 
outcomes within clusters to reduce the total required sample size. 


7 Multiple Outcomes 


While the clinical trial literature generally supports the use of a 
single primary outcome on which to base sample size calculation, 
there are instances where a clinical trial or cohort study may have 
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more than one primary outcome [47]. For example, multiple dis¬ 
tinct research questions may be deemed of equivalent importance, 
or the investigators may stipulate that a beneficial effect must be 
demonstrated on more than one outcome to demonstrate an over¬ 
all benefit to the patient. Sample size calculation for studies with 
multiple primary outcomes can be complex. While some attempts 
have been made to develop a single hypothesis test to simultane¬ 
ously evaluate the effect of a treatment on multiple outcomes [48], 
the most common approach is to perform separate hypothesis tests 
for each outcome individually. In this situation, it is necessary to 
choose the sample size of the study based on the largest of the 
minimum required sample sizes for each individual outcome. If 
outcomes have different required sample sizes, this strategy assures 
that each outcome achieves the desired minimum threshold for 
power, with even greater power achieved for outcomes requiring 
smaller sample sizes. Typically, studies including multiple primary 
outcomes are required to use smaller a-levels for each individual 
hypothesis test in order to preserve the overall type 1 error of the 
trial. Such adjustment to the a-level can lead to a substantial 
increase in sample size if more than 2 or 3 outcomes are included. 
When power is 90 %, use of 2, 3, or 4 co-primary endpoints 
increases the required sample size by approximately 18, 29, and 
36 %, respectively, if a standard Bonferroni adjustment is used. The 
increase in required sample size with multiple outcomes can be 
slightly reduced if estimates exist for the correlations between the 
different outcomes. 


8 Power Calculations for Longitudinal Cohort Studies 

The concepts of the preceding sections concerning sample size 
requirements of randomized clinical trials can also be applied to 
observational cohort studies estimate the effect of a treatment or 
exposure on an outcome variable. However, in contrast to ran¬ 
domized clinical trials, where randomization assures that the treat¬ 
ment assignment is approximately unrelated to other baseline 
factors, in cohort studies the exposure is often associated with mul¬ 
tiple confounding factors, which must be controlled for in the 
study design or data analysis. The range of strategies for control¬ 
ling confounding is extensive, and a complete review is beyond the 
scope of this chapter. Here, we consider the most common analytic 
approach to controlling confounding: covariate adjustment using 
multiple regression analysis. 

For a continuous outcome variable, the formula for the total 
sample size required for a comparison of the mean level of the out¬ 
come between the exposed and unexposed after adjusting for a set 
of covariates under multiple linear regression is given by the fol¬ 
lowing expression: 
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This expression differs from the sample size formula for a two- 
group comparison of means in a randomized trial in three impor¬ 
tant respects: (1) in contrast to a randomized trial, where 
investigators usually allocate equal proportions of subjects to the 
treatment and control groups, in cohort studies the proportion of 
exposed reflects the prevalence of the exposure in the study popu¬ 
lation, and may be anywhere between 0 and 1. The impact of the 
prevalence of the exposure (denoted/) on the required sample size 

is indicated by the first term, -—^-- , which reduces to the fac- 
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tor 2 in the corresponding formula in Table 1 if 50 % of subjects 
are exposed, but can be substantially larger than 2 if/is close to 0 
or 1; (2) the standard deviation of the outcome is replaced by the 
adjusted standard deviation that remains after accounting for the 
covariates included in the model. Typically the adjusted standard 
deviation is difficult to estimate, so that an estimate of the standard 
deviation of the outcome without covariate adjustment is often 
used as a conservative approximation; (3) an additional multiply¬ 


ing factor 


i ^ 
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called the variance inflation factor, is added 


to sample size formula to indicate the increase in the required sam¬ 
ple size that is needed to account for the squared multiple correla¬ 
tion R rcst 2 between the exposure and the covariates included in the 
model. The greater the correlation of the exposure with the covari¬ 
ates, the more difficult it is to separate the effect of the exposure 
from the covariates, which is reflected in a higher variance inflation 
factor. The same expression for the variance inflation factor can be 
used to approximate the effect of inclusion of covariates in analyses 
of binary or time-to-event outcomes [49-51]. 


9 Other Issues and Conclusion 

This overview has omitted numerous important specialized topics 
in sample size calculation. These include, but are not limited to, 
(1) Bayesian sample size calculation, (2) sample size for compari¬ 
sons of variability, (3) adjustments to sample size to account for the 
use of stopping rules, (4) sample size calculation in settings with 
numerous hypothesis tests such as microarray studies, (5) re¬ 
estimation of sample size under adaptive designs, and (6) estima¬ 
tion of sample size under complex longitudinal designs. We refer 
the reader to the text by Chow et al. [26] and references therein 
for these topics. 
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Fundamentally, we have stressed the importance of the link 
between sample size calculation and study design. This linkage 
highlights the importance of close communication between statis¬ 
ticians and biomedical investigators throughout all stages of the 
study design process in order to allow for an iterative evaluation 
required sample size and alternative study design and analysis strat¬ 
egies until a design for an adequately powered study emerges. 
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Chapter 14 


Randomized Controlled Trials 6: On Contamination 
and Estimating the Actual Treatment Effect 

Patrick S. Parfrey 

Abstract 

The intention-to-treat analysis is the gold standard for evaluating efficacy in a randomized controlled trial. 
However, when non-adherence to randomized treatments is high, the actual treatment effect may be 
underestimated. The impact of drop-out from the intervention group or drop-in to the control group may 
be controlled by trial design, increasing the sample size, effective study execution, and a prespecified ana¬ 
lytical plan to take contamination into account. 

These analyses may include censoring at time of co-interventions associated with stopping treatment, 
lag censoring which allows an additional period after discontinuation of study treatment to account for 
residual treatment effects, inverse probability of censoring weights (IPCW), accelerated failure time mod¬ 
els, and contamination adjusted intent-to-treat analysis. These methods are particularly useful in assessing 
the “prescribed efficacy” of the study treatment, which can aid clinical decision-making. 

Key words Randomized controlled trials, Non-adherence, Drop-in, Drop-out, Censoring, Inverse 
probability of censoring weights, Accelerated failure time models 


1 Introduction 


The intention-to-treat (ITT) analysis is the gold standard for 
evaluating the efficacy of a study treatment in randomized 
controlled trials. However, when non-adherence to randomized 
therapies is high the actual treatment effect may be underesti¬ 
mated. Contamination occurs as a result of a drop-out from the 
intervention group by patients who prematurely stop taking the 
study treatment prior to experiencing a primary end point (treatment 
non-adherence) or of drop-in whereby patients in the placebo 
group prematurely stop taking placebo and start the commercially 
available treatment prior to experiencing an end point (treatment 
crossover) [i]. As both these occurrences lead to the assumption 
of risk similar to that in the opposite treatment group they dimin¬ 
ish the power of the study to observe a treatment effect. The 
potential impact of contamination needs to be taken into account 
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in the study design, sample size estimate, study execution, and 
analysis. In particular a prespecified analytic plan to assess the 
impact of contamination will diminish the bias associated with post 
hoc data-driven analyses, but also answer the clinically important 
question: What is the treatment effect size in patients who actually 
receive the recommended intervention? 


2 Controlling the Impact of Contamination 


2.1 Trial Design Planning to limit and quantify treatment contamination is necessary. 

In particular consideration needs to be given to co-interventions that 
occur during trial that may induce withdrawal from the treatment. 
One solution for this is to stop following the patient at time of the 
co-intervention. However, this action curtails the assessment of the 
long-term impact of the study intervention on both study outcomes 
and on safety. The optimal solution is to prespecify in the analytic 
plan how to take account of co-interventions. 

2.2 Sample Size The sample size estimate should take account of both drop-in and 

Estimate drop-out rates likely during the trial, with consequent increase in 

the number to be enrolled. This requires an accurate prediction of 
the rates of both types of non-adherence which may be difficult. 
For example, if the trial is of a commercially available treatment the 
potential for drop-in exists, particularly if clinical practice guide¬ 
lines identify a clinically important role for the intervention. Even 
though equipoise exists for the research question being answered 
in the trial some physicians enrolling patients may feel ethically 
obligated to follow the clinical practice guideline, even if it is based 
on inadequate information. The potential for drop-out exists if the 
trial is for a long duration, particularly if it tests a novel compound 
in a group of patients with substantial comorbidity. This drop-out 
rate may be difficult to predict. 

2.3 Study Execution Unanticipated drop-in or drop-out may make a trial uninterpre¬ 

table. The sponsor of the trial and investigators are blinded to 
whether a patient has been randomly allocated to the interven¬ 
tion or control group, but careful monitoring is necessary to 
determine the extent of cessation of investigational product and 
the prescription of the commercial product. Surveys to assess 
and optimize adherence, and use of intermediate measures of 
adherence and effectiveness may limit treatment contamina¬ 
tion [1]. Unexpected increase in the rates of treatment non¬ 
adherence or prescription of commercial treatment requires 
immediate intervention in study centers to ensure that the study 
protocol is being followed and that investigators maintain the 
belief that equipoise exists regarding the research question. 
Investment in time by monitors engaging with research nurses 
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and by the study principal investigators communicating with 
local investigators is likely to enhance the quality of the trial 
execution, and to determine whether safety issues are the cause 
of non-adherence. Action on safety is of paramount importance. 
This will require efficient aggregation of accurate data on all 
patients enrolled, with subsequent assessment by the indepen¬ 
dent Data Monitoring Committee, in a timely manner. 

2A Statistical Plan The ITT method is comprised of three principles which include: 

(1) using all randomized patients regardless of whether they have 
any follow-up data, (2) using randomized treatment assignment 
regardless of what the patient actually received, and (3) including 
all follow-up information in the analysis [2]. Based on these prin¬ 
ciples, prognostic factors of the outcome should be balanced and 
any differences in outcome that are observed can be attributed to 
the treatment [3]. However, failure to assess the impact of con¬ 
tamination may mask a beneficial treatment effect. 

In the past, treatment contamination was addressed using “as 
treated” and “per protocol” analyses. With the former technique 
participants were analyzed entirely on the basis of treatment 
received and in the latter participants who failed to follow the pro¬ 
tocol were dropped from the study. However, these approaches 
remove the benefits of randomization and result in non-random 
omission bias, and have appropriately fallen out of favor [1]. 
However, it is important to obtain the most accurate estimate of 
efficacy in patients who actually receive the intervention. 

Prespecified analyses to assess the impact of contamination on 
the estimated treatment effect may include censoring at time of 
co-interventions associated with stopping treatment, lag censor¬ 
ing, inverse probability of censoring weights (IPCW), rank pre¬ 
serving structural failure time model (RPSFTM), and interactive 
parameter estimation (IPE). Recently contamination adjusted ITT 
analysis has been proposed [ 1 ]. 


3 Statistical Methods 

3.1 Lag Censoring Lag censoring analysis is a variation of naive censoring where data 

are censored at a specific time point (e.g., at the time of non¬ 
adherence to study treatment). The lag censoring method allows 
an additional period after discontinuation of study treatment to 
account for residual treatment effects. Although lag censoring pre¬ 
serves randomization, and is simple to use and understand, it vio¬ 
lates an ITT principle in that it does not include all follow-up 
information. Furthermore it assumes that stopping study treat¬ 
ment is random between the two treatment groups, but there may 
be informative bias if non-adherent patients (compared to adher¬ 
ent patients) have different prognostic characteristics predictive of 
a primary event. 
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3.2 Inverse 

Probability 

of Censoring Weights 


3.3 Rank Preserving 
Structural Failure 
Time Model 


With the IPCW method data is censored at time study treatment 
is discontinued for non-adherent patients, but weighs more heavily 
on the results from patients who remained on study treatment 
with similar characteristics [4]. This method models the causal 
effect of treatment on outcomes, while accounting for time- 
independent and time-varying confounders [5]. In this context, 
confounders are variables that are affected by prior exposure to 
treatment and predict subsequent exposure to study drug and the 
outcome [6]. A key principle of the IPCW method is to recre¬ 
ate the population that would have been observed had patients 
remained on assigned study treatment. It does so by censoring 
data at the time of study treatment discontinuation for non¬ 
adherent patients and assigns weights that are proportional to 
the inverse of the probability of remaining on study drug given 
each individual patient’s characteristics. 

To derive the weights, patients’ follow-up time until the time 
of study treatment discontinuation is portioned into several inter¬ 
vals. The probability of remaining on study treatment at the end of 
each interval adjusted for baseline variables and time-varying con¬ 
founders is estimated using a pooled logistic regression model. To 
avoid possible extreme values when taking the inverse of these 
probabilities, these weights are stabilized by multiplying the prob¬ 
ability of remaining on study treatment, conditional only on base¬ 
line variables. 

The IPCW preserves randomization, takes into account infor¬ 
mative censoring, and adjusts for time-dependent confounders, 
but it is sensitive to the number of non-adherent patients and 
assumes that there are no unknown confounders. It is computa¬ 
tionally difficult to implement since it involves splitting the data 
into appropriate time intervals, location of the dataset is difficult, 
and parameter estimates may not be stable, as the model may not 
converge. Nonetheless, it has been used in many large long-term 
clinical trials and is accepted by many health care agencies. 

This method is based on an accelerated failure time model, which 
assumes that exposure to treatment has a multiplicative effect on a 
patient’s survival time. The actual treatment effect can be esti¬ 
mated using a causal model to relate the multiplicative effect and 
the individuals observed failure time to their counterfactual failure 
time [7], the time that would have been observed if no treatment 
was received. In the RPSFTM framework the counterfactual failure 
time is a pre-randomization variable and is independent of ran¬ 
domization. The treatment effect can be obtained from a grid 
search over a range of plausible values, until the counterfactual 
failure times are equally distributed between the treatment groups 
using a test based method (i.e., log rank) [8]. 
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3.4 Iterative This method is also based on an accelerated failure time model. 

Parameter Estimation The principle is that the observed survival time of non-adherent 

patients can be transformed to the survival time that could have 
been observed had these patients remained on study treatment, 
with non-adherent patients assuming the risk of an event of a 
patient in the opposite treatment arm [9]. 

IPE models survival time as if drop-in patients never started 
the commercially available treatment and drop-out patients 
remained on the study intervention. The survival is contracted for 
drop-in patients and expanded for drop-out patients. Rather than 
using a test-based method IPE uses a parametric likelihood esti¬ 
mation method to derive treatment effect. Survival times are 
transformed through an iterative process until the model con¬ 
verges. IPE preserves randomization and there is no need to 
model the pattern when patients drop in or drop out. It assumes 
non-adherence is random, but is prone to informative bias. It 
requires parametric modelling with the need to specify the correct 
distribution. It is required to re-censor data when the transformed 
survival time is beyond the study termination date. Computational 
methods such as bootstrapping are required to obtain robust con¬ 
fidence intervals. 


3.5 Contamination The instrumental variable technique was traditionally used in non- 
Adjlisted ITT randomized research studies. It used a variable associated with the 

factor under study but not directly associated with the outcome 
variable or any potential confounder [ 1 ]. The RCT is treated as an 
instrumental variable with treatment assignment as the “instrument”. 
The effect of treatment assignment on the outcome observed 
(ITT analysis) is adjusted by the percentage of assigned partici¬ 
pants who ultimately receive the treatment (contamination 
adjusted) [1]. In this way the effect of treatment receipt, rather 
than treatment recommended, on the risk of the outcome can be 
obtained. However, this methodology has not been well developed 
for survival analysis and it is quite complicated to apply. 


4 An Example of Analysis of a Trial with Extensive Non-adherence 

EVOLVE was a randomized controlled trial in 3,883 patients on 
hemodialysis with secondary hyperparathyroidism comparing cina- 
calcet to placebo, during which extensive non-adherence occurred 
[10,11], The data from this trial is used to examine the use of each 
statistical method. 

Cinacalcet (Sensipar®/Mimpara®, Amgen Inc.) is a calcimi- 
metic agent currently approved for the treatment of secondary 
hyperparathyroidism in patients with chronic kidney disease receiv¬ 
ing dialysis. The EVOLVE trial was a global, multicenter, placebo 
controlled, double-blind, event driven trial (N= 1,880) designed 
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to assess the risks and benefits of cinacalcet compared to placebo 
along with conventional, standard-of-care therapies (including 
phosphate binders and vitamin D sterols in the majority of patients) 
on a composite end point consisting of all-cause mortality and 
major cardiovascular events (myocardial infarction, hospitalization 
for unstable angina, heart failure, or peripheral vascular event) [11]. 
Patients were randomized 1:1 to cinacalcet or placebo, stratified by 
history of diabetes (yes/no) and country. At the time of enroll¬ 
ment, cinacalcet was commercially available in 18 out of the 22 
(82 %) countries participating in the study. Although the use of 
commercial cinacalcet was discouraged, physicians had the option 
to prescribe commercially available study drug if deemed clinically 
important. 

The original trial duration was anticipated to last 4 years, but 
due to the lower than expected pooled event rate, the trial was 
extended to 5.5 years. During the course of the study, a large pro¬ 
portion of patients prematurely withdrew from treatment. Of the 
1935 patients randomized to placebo, 1,365 (71 %) discontinued 
study drug and 1,300 (67 %) of the 1948 patients randomized to 
cinacalcet also discontinued study drug (Table 1) [10]. 

Discontinuation for protocol specified reasons was similar in both 
intervention and placebo groups, but discontinuation for non¬ 
protocol specified reasons was higher in the placebo group (Fig. 1). 
These rates are 2-3 times higher than in other large, long-term 
cardiovascular outcomes studies of comparable sample size in 
which study drug discontinuations rates ranged between 20 and 
30 % [12]. The time on treatment was approximately half of the 
total time patients were in follow-up for end points in both treat¬ 
ment groups; the median time on treatment was 17.5 months in 
the placebo group compared to 21.2 months in the cinacalcet 
group. In addition, a substantial proportion of patients also 
received commercially available cinacalcet during the trial (11 % in 
the group randomized to cinacalcet and 23 % in the group ran¬ 
domized to placebo). Moreover, 14 % of patients randomized to 
placebo and 7 % of patients randomized to cinacalcet underwent 
parathyroidectomy, a surgical and more definitive approach to 
managing hyperparathyroidism. As reported previously [10], 384 
(19.8 %) patients randomized to placebo received commercially 
available cinacalcet prior to the occurrence of a primary event cor¬ 
responding to an annual rate of 7.4 %. Similarly, 1,207 (62 %) of 
patients randomized to cinacalcet discontinued study drug prior to 
the occurrence of a primary event (corresponding to an annual rate 
of 27.3 %), effectively resulting in crossover between study arms. 

For the primary analysis, Kaplan-Meier product limit esti¬ 
mates of event free survival were compared between the two 
groups using a two-sided, stratified log-rank test. The primary end 
point did not achieve statistical significance (^-value = 0.112) [10]. 
The relative hazard for the reduction in the risk of cardiovascular 


Table 1 

Reasons for discontinuing study drug in EVOLVE [10] 
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Cinacalcet 
(N= 1,948) 

Placebo 
(N= 1,935) 

Subjects who discontinued study drug (%) 

66.7 

70.5 

Ineligibility determined 

0.1 

0.3 

Consent withdrawn 

1.8 

2.2 

Lost to follow-up 

0.6 

0.6 

Adverse event 

15.8 

11.8 

Protocol-specified reasons 

22.1 

20.2 

Parathyroidectomy 

2.4 

7.6 

Kidney transplant 

13.3 

11.9 

Calcium <7.5 mg/dL or symptoms of hypocalcemia 

1.1 

0.1 

Low PTH 

5.2 

0.4 

Pregnancy 

0.0 

0.1 

Administrative decisions/subject request 

20.6 

30.7 

Hyperparathyroidism 

1.9 

6.5 

Commercial cinacalcet 

0.4 

1.6 

Adverse event 

2.3 

1.2 

N oncompliance 

3.5 

3.3 

Other administrative decision/subject request 

12.9 

19.7 

Commercial cinacalcet 

1.2 

5.6 

Other reasons 

5.4 

4.5 

Missing reason 

0.2 

0.2 

Never received study drug 

0.5 

0.6 


N= Number of randomized patients. Percentages are based on N 


events or death was 0.93 (95 % confidence intervals (CIs):0.85-1.02) 
for patients randomized to cinacalcet compared to placebo [10]. 

Although the ITT analysis is a valid test to compare two 
treatment policies, it does not provide the best estimate of the 
actual effect of the study drug when there is considerable non¬ 
adherence [1, 3]. No other therapies have effectively reduced 
the burden of cardiovascular disease or mortality in the hemodi¬ 
alysis population; thus, detailed assessment of the estimated treat¬ 
ment effect in EVOLVE should be particularly relevant to clinical 
decision - making. 
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No. at Risk 

Placebo 1923 166? 14191211 1033 897 777 675 595 530 461 396 336 20S 105 30 2 

Cinacaket 193816861491 1312 1180 1050 953 852 769 667 591 S27 445 290 158 37 3 



No. at Risk 

Placebo 1923 16681421 1213 1013 900 787 688 609 543 476 409 151 218 115 35 2 

Cinacaket 1918168914991322 1191 1060 966 861 780 680 605 541 458 302 167 44 3 


Fig. 1 (a) Time to discontinuation of study drug in the EVOLVE trial for protocol-specified reasons [10], (b) Time 
to discontinuation of study drug for non-protocol specified reasons [10]. From: Effect of cinacalcet on cardio¬ 
vascular disease in patients undergoing dialysis. The EVOLVE Trial Investigators. N Eng J Med. 367:2488. 
Copyright 2012 Massachusetts Medical Society. Reprinted with permission 
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4.1 Censoring 
at Time of Co¬ 
interventions 


4.2 Lag Censoring 


4.3 Inverse 

Probability 

of Censoring Weights 


Table 2 

Relative Hazard (Cinacalcet vs Placebo) in EVOLVE using different 
statistical methods [10] 


Analysis 

HR (95 % Cl) 

Unadjusted intention-to-treat 

0.93 (0.85, 1.02) 

Censoring at renal transplant, 

0.84 (0.76, 0.93) 

parathyroidectomy, on commercial cinacalcet 1 


Lag censoring 6 months after drug stop 

0.85 (0.76, 0.95) 

Inverse probability censoring weight b 

0.81 (0.70, 0.92) 


a Co-interventions that reduce serum PTH levels and lead to withdrawal of study 
cinacalcet 

b Odds ratio and 95 % confidence intervals (Cl) from the final pooled logistic regres¬ 
sion model 


As the co-interventions kidney transplantation, parathyroidectomy, 
and use of commercial cinacalcet reduced parathyroid hormone 
levels and were associated with cessation of study cinacalcet, Data 
was censored after each of the co-interventions. For each of these 
co-interventions the relative hazard for the primary end point was 
0.90 (95 % Cl = 0.82-0.99; ^<0.03) [10]. Censoring at the time 
of any of these three events yielded a relative hazard of 0.84 (95 % 
0 = 0.76 = 0.93; ^<0.001) (Table 2). 

A priori 6 months post-study drug discontinuation was selected as 
the lag period, as we hypothesized a persistent effect of cinacalcet 
on the progression of cardiovascular disease related to on-treatment 
parathyroid hormone lowering effects. Study drug discontinuation 
was defined as the time point at which a subject discontinued study 
drug permanently for any reported reason or beyond which there 
was at least 6 months gap in the subject’s study drug administra¬ 
tion, whichever was earlier. Using this approach, follow-up time 
and events accrued beyond 6 months following study drug discon¬ 
tinuation were not included in the analysis. Using the lag censoring 
method, all randomized patients were included and their respective 
randomized assignments were preserved in the analysis [10]. The 
relative hazard was 0.85 (95 % Cl: 0.76, 0.95) (Table 2). 

Data were split into time intervals starting from randomization 
and up until patients had an event, discontinued study drug or 
completed study, whichever occurred first. The probability of 
continuing to receive study drug at the end of each time interval 
was derived using two pooled logistic regression models. 
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A weight was assigned a weight for each patient based on the 
inverse of the predicted probability of continuing to receive study 
drug based on baseline and time-dependent covariates at the end 
of each interval. To create more stabilized weights, the weight was 
multiplied by the predicted probability of continuing to receive 
study drug based only on a baseline covariates. Therefore, heavier 
weights were assigned to patients who did not drop out but had 
similar characteristics to those who did. A final cumulative weight 
for each patient was calculated by multiplying the weights from 
each time interval. The adjusted treatment effect was estimated 
using a weighted Cox regression model where data for patients 
who discontinued study drug were censored at the time of discon¬ 
tinuation. Using the IPCW method, the relative hazard for the 
primary composite end point was 0.77 (95 % Cl 0.66-0.88) [10]. 

4.4 Interpretation The sensitivity analyses performed in the EVOLVE trial, that adjust 

for treatment contamination, suggests that the true effect size in 
patients who actually received cinacalcet is larger than the effect 
size in patients recommended for cinacalcet (i.e., randomized to 
cinacalcet). It seems reasonable to adjust for co-interventions that 
lower parathyroid levels and induce cessation of the drug. The 
IPCW method is attractive because it is not prone to informative 
bias while adjusting for contamination. Although lag censoring is 
prone to informative bias the hazard ratio is similar to that obtained 
by censoring at co-interventions or with the IPCW method. 


5 Conclusion 


While the ITT method remains the gold standard to establish effi¬ 
cacy of a study treatment, additional analyses should be considered 
to assess the impact of contamination on the treatment effect esti¬ 
mate derived from the ITT analysis. Such analyses are particularly 
useful in assessing the “prescribed efficacy” of the study treatment, 
which can aid clinical decision-making. 
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Chapter 15 


Randomized Controlled Trials 7: Analysis 
and Interpretation of Quality-of-Life Scores 

Robert N. Foley and Patrick S. Parfrey 

Abstract 

Quality-of-life (QoL) outcomes are important elements of randomized controlled trials. The instruments 
for measurement of QoL vary but usually multiple comparisons are possible, a concern that can be offset 
by prespecifying the outcomes of interest. Missing data may threaten the validity of QoL assessments in 
trials. Therefore familiarity with the strategies used to account for missing data is necessary. Measures that 
incorporate both survival and QoL are helpful for treatment decisions. The definition of minimal clinically 
important differences in QoL scores is important and often derived using inadequate methods. 

Key words Quality of life, Assessment, Measurement scales, Patient-reported outcome, Missing data, 
Quality-adjusted survival, Minimal clinically important difference 


1 Introduction 


Quality-of-life (QoL) outcomes are important elements of most 
pivotal randomized controlled trials. Even in trials where QoL has 
secondary importance, some familiarity with the analytical chal¬ 
lenges posed by QoL outcomes is important for overall interpreta¬ 
tion of trial findings. At their core, QoL outcomes share the same 
challenges as other measurements that are measured longitudinally 
in patients in trials. As with studies examining parameters like lipid 
levels, glycemic control, body mass index or blood pressure, one 
can anticipate that QoL studies are under constant threat from 
missing data, and one needs to understand whether the overarch¬ 
ing need is to measure time-integrated (or area under curve 
[AUC]) differences between treatments or to capture differences 
at a specific point in time. Treatment allocation is intuitively impor¬ 
tant for patient-reported outcomes like QoL; even where initial 
treatment allocation is concealed, it is worth pondering how likely 
long-term concealment may be feasible. While multiplicity of com¬ 
parisons is not unique to QoL studies, it is often more of a concern, 
because many QoL studies have multiple instruments, domains, 
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and large numbers of individual items. Some QoL instruments, 
especially those that capture patient preferences and utilities into 
single scores, are useful for integrating cost and anticipated sur¬ 
vival, allowing direct comparisons across different disease states 
and treatments. Finally, perhaps the most difficult issue that arises 
with QoL studies concerns the translation of subjective, patient- 
reported outcome into objective metrics; although easily applicable, 
valid estimates are rarely available, the impact of a given QoL dif¬ 
ference in a randomized trial would be greatly enhanced if readers 
knew what constituted a clinically significant increment in QoL for 
the instrument under consideration. 


2 QoL Instruments 


In this chapter, we assume that readers have some familiarity with 
basic statistical techniques, especially regarding longitudinal com¬ 
parisons of outcomes that are typically interval in nature. This pre¬ 
amble presents some basic steps that an interested, but nonspecialist 
reader might consider when presented with quality-of-life data in 
the setting of a large randomized control trial, with major clinical 
events as the primary outcome. A fundamental principle of patient- 
reported quality-of-life assessments is that they should come from 
the patient. 

Lack of familiarity with QoL scales and scoring may partly 
explain a tendency for some health care professionals to consider 
QoL as lacking in scientific validity and clinical usefulness. Many 
scales are very simple, and the majority are based on linear tem¬ 
plates like rating health status on a line varying from 0 (the worst 
possible) to 10 (the best possible), or categorical templates with 
descriptions like “not at all,” “a little,” “quite a lot,” and “very 
much.” When scales contain multiple items, these are usually 
totaled across all variables, but it is worth checking that this is a 
feature of the instrument under consideration. In many schemes, 
the working score is standardized to a range of 0-100, a summary 
measure that is often called the “scale score.” Standardizing scales 
in this manner can allow readers to discern dominant QoL effects 
in controlled trials. Where expected scores in the general popula¬ 
tion are known, scores can be reported in terms of population 
norms, usually based on age and gender. 

Generic instruments, designed to be applicable in a wide range of 
conditions, have the advantage of allowing comparisons across differ¬ 
ent health states. In more specific disease states, generic instruments 
may not be able to capture the health issues of paramount concern to 
patients with that disease and may not be responsive to disease-specific 
treatments. This has led to the development of disease-specific 
questionnaires. Thus, a common approach in large clinical trials is to 
use both a generic and a disease-specific instrument. 
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Given that it is the most widely used QoL instrument in 
clinical trials, all health care professionals should have a familiar¬ 
ity with the Medical Outcomes Study 36-Item Short Form (SF- 
36) [1]. The SF-36 is measure of general health status, which was 
developed to fill a gap between much lengthier time-consuming 
questionnaires and single-item instruments. It was designed to be 
applicable to all ages, all diseases, and in healthy populations. 
Capturing information in three main areas (physical, social, and 
emotional functioning), it can be self-assessed or administered by 
trained interviewers. It has been validated in a wide variety of health 
states, demographic subgroups, cultures, and languages. In terms 
of overall structure, 36 questions addressing eight health concepts 
produce summary measures for both physical and mental health. 
Physical health is divided into scales for physical functioning (ten 
items), role-physical (four items), bodily pain (two items), and 
general health (five items). Mental health encompasses scales for 
vitality (four items), social functioning (two items), role-emotional 
(three items), and mental health (five items). In addition, there is 
a question about the trajectory of general health, as follows: 
“Compared to 1 year ago, how would you rate your general 
health now?” There is also a global question about overall health: 
“In general, would you say your health is: (excellent, very good, 
good, fair, poor)?” Regarding time frames, most questions refer to 
the past 4 weeks, although some relate to the present. 

A fundamental qualitative question that needs to be consid¬ 
ered is whether the instrument makes intuitive sense in the context 
of the clinical trial in which it is being used. The main aims of the 
instrument should be clear, there should be a rational basis for the 
dimensions of the instrument, and intended usage criteria should 
be well defined. The procedures used to develop and validate the 
questionnaire should pass muster. There should be documented 
information that the instrument is suitable for the target popula¬ 
tion. Given that missing data is an issue that threatens the validity 
of many QoL trials, it is useful to have an idea of associated ease of 
administration and the time required for completion. 

2.1 QoL Scoring As scoring systems for QoL instruments vary widely, a concise, 

accurate description of the scoring procedure should be easily 
available. It is important to know whether a global QoL score 
exists within the instrument and whether multiple scales can be 
combined to derive a global score for overall QoL. If available, 
guidelines for clinical interpretation of absolute values and changes 
in scale scores are very helpful. Many of the more commonly used 
instruments have multiple scales and it is important to know when 
and how to group component scales. Some scores have been tested 
in the general population to produce normative data, usually strati¬ 
fied by age and gender. Thus, scores from these instruments can be 
reported in terms of expected values from the general population 
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by a number of different methods, including percentile score, 
Z-scores, and X-scores. Z-scores are defined as the number of stan¬ 
dard deviations from the mean of the reference population; 
X-scores are quite similar to Z-scores but have a mean of 50 (vs. 0 
with Z-scores) and standard deviation of 10 (vs. 1 with Z-scores) in 
the general population. 


3 Analysis of Treatment Effect 

Even where QoL is an ancillary outcome in a clinical trial, it is 
useful to know whether a single variable within a specific time 
frame was designated as being of principal interest during the plan¬ 
ning phase of the trial. As with other outcomes in clinical trials, it 
is useful to know whether statistical power was considered before 
the trial and whether the likelihood of dropouts occurring was 
incorporated in the design. As many QoL instruments have a large 
number of subsidiary scores and are measured at multiple time 
points, this approach can mitigate the risk of undue attention, after 
the fact, on a single test result showing a between-treatment differ¬ 
ence at a single time point. Thus, it may be worth checking whether 
treatment-related P-values are adjusted for multiple comparisons. 
Given that the intent of randomization is to generate groups with 
similar clinical and demographic characteristics, it is important to 
check whether randomization was actually successful. When 
imbalances are found, it is critical that QoL analyses are presented 
with and without adjustment for known differences at the time of 
randomization. Lor trials showing no differences between treat¬ 
ments, it is very useful to know the statistical power that was actu¬ 
ally available when the trial concluded. 

Treatment allocation is a critical consideration in trials with 
QoL outcomes. Even when placebo treatments are used through¬ 
out, patients and treatment teams can determine probable treat¬ 
ment in some situations. Lor example, in a trial where anemic 
patients are randomly assigned to different hemoglobin targets, it 
may not be ethical, or feasible, to prevent hemoglobin levels being 
measured outside of the trial setting, especially when trials are of 
long duration and patients receive multiple types of specialist clini¬ 
cal care and patients are likely to be hospitalized during the course 
of the trial. If this situation applies, the likelihood of successful 
treatment concealment would be expected to decline over time. 
Although rarely seen in trial reports, it would be very useful in trials 
predicated on targeted laboratory variables, to know the compara¬ 
tive frequency of nonscheduled, nonconcealed measurements of 
these laboratory variables. 

While several approaches to dealing with missing data are dis¬ 
cussed later on, it is important to get a numerical sense of overall 
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compliance and whether this differs between treatment arms. 
Compliance is calculated as a proportion, with the number of 
completed QoL questionnaires as numerator, and the number of 
expected questionnaires as denominator, bearing in mind that 
QoL forms can only be expected for patients still alive at the time 
point in question. A related graphical approach is to perform a 
survival analysis plot of time versus survival without missing the 
scheduled QoL element. 

Most trials measure QoL longitudinally at predetermined, 
regular intervals. It is rarely feasible, for logistical issues, to per¬ 
form each QoL assessment at exactly the desired interval, and 
acceptable time windows are usually employed. While the need for 
time windows is understandable, readers of trial results need to 
examine the procedures employed and use their own judgment as 
to their appropriateness. Ever longer windows can threaten valid¬ 
ity, and as elsewhere, it is worth checking whether decisions about 
window length were planned before the trial began, whether visits 
occurring outside windows were as likely to be early as late, and 
whether attendance patterns were the same in all treatment arms 
of the trial. 

While this chapter is not intended to cover statistical tech¬ 
niques in detail, it is worth checking whether the main statistical 
analysis tool employed is what one would expect for the primary 
QoL outcome being assessed in the trial. Thus categorical, ordi¬ 
nal, and interval outcomes should use the appropriate type of sta¬ 
tistical test, and distributional requirements of the test should be 
respected. Many trials use an area-under-the curve approach for 
primary outcomes, and the statistical tools should reflect this. 
Imbalances of covariates at baseline should prompt the reader to 
look for analyses that adjust for these imbalances. Decisions to 
form categories from variables that are intrinsically continuous 
should have a sound underlying rationale, and decisions made after 
the fact should be viewed with a degree of skepticism. As journals 
now have the facility to publish large amounts of additional infor¬ 
mation as online appendices, word and page limits can hardly jus¬ 
tify the lack of availability of comprehensive descriptions of study 
procedures and results. 


4 Missing Data 


Missing data commonly threaten the validity of quality-of-life 
assessment in trials. It is essential to develop a sense of whether the 
available data are representative of the QoL of the combined group 
of patients with and without missing data. In the planning phase 
of trials examining QoL, it is useful to develop a simple system 
for describing the causes of failure to capture these data elements. 




266 


Robert N. Foley and Patrick S. Parfrey 


For readers of these trials, tracking the numbers of tests over time 
and comparisons between treatment arms are essential components 
of the critical appraisal of these trials. Similarly, where missing data 
are substantial, a detailed comparison of the baseline characteristics 
of subjects who do or do not complete all of the scheduled assess¬ 
ments should be available. 

Some familiarity with the terminology and strategies used in 
different missing data scenarios can be useful. It is easy to envis¬ 
age several situations where the fact of missing data could be 
informative. For example, subjects may not be able to complete 
the quality-of-life assessment because of advancing illness or may 
not appear for testing because symptoms of the illness have 
abated. Although rarely performed, it is possible to get a numeri¬ 
cal estimate of the relationship between failing to complete a 
QoL assessment and a hard outcome, like death. For example, in 
a trial where death is the primary outcome and QoL is an ancil¬ 
lary outcome, it would be instructive to treat the time elapsed 
between baseline and failure to complete a scheduled QoL test as 
a time-dependent outcome in a statistical model where time to 
death is the primary outcome. 

A simple approach to dealing with missing data is to consider 
only subjects with complete information. This is a standard tech¬ 
nique, often used for outcomes like blood pressure, body mass 
index, lipid levels, and glycemic control. Unlike these parameters, 
many widely used QoL instruments have multiple individual items, 
and it is worth noting what exactly constitutes a missing case. 
For example, if one of 36 items is missing, it is not possible to use 
actual data for that scale, and a global score is not available. Scores 
based on the remaining 35 items are available, however, and may 
provide useful insights. Whether employed or not, it is worth 
pointing out that exclusive reliance on complete-case analysis is the 
equivalent of assuming that missing data are absent completely at 
random. In practice, it is common practice to add alternative strat¬ 
egies when the totality of missing data exceeds predefined propor¬ 
tions (often 5 % in large trials) or when proportions different by 
randomized treatment allocation. 

Available-case approaches have obvious disadvantages. For 
example, when many assessments are made, it becomes extremely 
likely that some, or all, of a single assessment will be missing in 
most patients. Analyzing available data separately at different time 
points is an intuitive approach to this problem, although it typically 
results in different numbers of subjects being available at different 
time points in the trial. Summary measures, like AUC, or greatest 
change in QoL are another intuitive, commonly used approach, 
but deserve scrutiny as it can appear that all patients were 
compared, even though the extent of missing data is large 
enough to imperil the conclusions. For example, even if ten 
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postrandomization assessments were planned, only one is required 
to calculate both an AUC and a largest QoL increment. 

A frequently employed approach that attempts to provide a 
middle ground between complete-case and available-case approach 
is to substitute missing data with data that are imputed based on an 
arbitrary set of rules. For example, many investigators elect to carry 
forward the last available measurement. As with many other facets 
of randomized trials, it is worth assessing whether the algorithm 
used for imputation was formulated in the planning phase of the 
trial. Regardless of whether or not an a priori approach was used, 
no amount of imputation can eliminate the threat of bias generated 
by large degrees of missing data. While not often reported, it is 
often very useful to quantify the total number of complete forms 
that are missing as well as the number of individual items within 
the form. A detailed exposition of different imputation techniques 
is available in large QoL textbooks, but basic dichotomy is the use 
of existing data from the subject with a missing data element, as 
opposed to using data from other patients [2]. 


5 Quality-Adjusted Survival 

Changes in survival and QoL may not occur in parallel, and 
measures that jointly incorporate both elements can be very useful 
for treatment decisions, both at the level of individual patients and 
in terms of cost to society as a whole. In most quality-adjusted 
survival models, health ratings can vary between 0 (the worst imag¬ 
inable) and 1 (the best imaginable). All of these ratings are patient 
preferences, but the term “utilities” is usually used when the out¬ 
come being assessed is uncertain. Many procedures have been 
employed to measure these utilities. With visual analog scales, for 
example, study subjects are asked to rate their current health status 
on a line bounded by 0 (worst) and 1 (perfect). Standard gamble 
techniques produce utilities by asking questions like “If there is X 
percent chance of death in the next year, but you have to take this 
therapy for the rest of your life, would you take the treatment?”, 
varying X and arriving at a point of indifference. Time trade-off 
techniques are quite similar, but the point of equipoise is achieved 
by asking questions like “Would you pick X months in perfect 
health or 1 year at your current health?” Another variant, 
Willingness to Pay, asks similar questions that are predicated on the 
maximum monetary price subjects would be prepared to pay to 
avoid adverse health states. 

Assuming a valid utility measure has been obtained, varying 
between 0 and 1, quality-adjusted life years are easily calculated. 
Consider a progressive disease with multiple known states; in the 
first state, the survival is S x years and the utility of that state is Ch, 
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while in the final state, the survival is S f years and the utility is U f . 
The overall number of quality-adjusted life years (QALY) is calculated 
as: QALY = U 1 S 1 + U 2 S 2 + ... + U n S n + .. .U f S f . When treatment costs 
are known, the cost-utility per QALY uses cost as numerator and 
QALY as denominator: Cost-Utility per QALY= Cost/QALY. 
In intervention trails of control (C) and experimental treatments 
(E), the cost per QALY gained with the experimental treatment 
is calculated as: Cost per QALY Gained E vs c = (Cost E -Cost c )/ 
(QALY e -QALY c ). 

Many variants of the QALY approach can be defined. For 
example, consider a disease whose treatments have large effects on 
QoL but are followed by treatment-free periods of good health, 
and possibly, by a relapse to the original disease state. If the pro¬ 
portions of overall survival spent in each state are known, it is pos¬ 
sible to quantify quality-adjusted time without symptoms and 
toxicity (Q-TWIST). Healthy years equivalents (HYE) are another 
variant, where study subjects report the number of years in full 
health they would trade for current health states. 


6 Clinical Interpretation and Clinically Important Differences 

While many patients struggle with interpreting how meaningful a 
finite change in blood pressure might be, many health care profes¬ 
sionals are quickly able to decide whether such a change is of mini¬ 
mal, moderate, or great importance. Although it is very easy to see 
that QoL measures could be fundamentally important to patients 
and changes in QoL have survival implications in many studies, 
many health care professionals discount real differences in QoL, 
because they do not understand their clinical meaning. Lack of 
proof of clinical importance applies equally to many other mea¬ 
sures, like changes in blood pressure, body mass index, lipid, or 
blood glucose levels, not least because proof is difficult to ascer¬ 
tain. It may well be the case that lack of familiarity with QoL scales 
contributes to the difficulties health professionals have with inter¬ 
preting clinical importance. 

Just like trying to estimate the value of a finite change in blood 
pressure in randomized trial, determining the clinical significance 
of a change in a QoL instrument is a challenging proposition. The 
most obvious approach is to actually ask the patients, as in asking 
them to rate the importance of the change in overall status, per¬ 
haps with a multiple-category Likert scale or with a linear visual 
analog scale. This approach, often called the anchor-based method, 
has the critical advantage of actually respecting the core philoso¬ 
phy of quality-of-life assessment by relying on what patients say, as 
opposed to relying on what health care professionals think [3]. 
This approach, however, has the disadvantage of requiring another 
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scale to evaluate changes in the scale being evaluated. As this 
means a formal validation process, it is not surprising that robust 
anchor-based evaluation of minimal clinically important differ¬ 
ences are lacking even for the most frequently used QoL instru¬ 
ments. A more common approach is to evaluate changes in QoL 
with regard to observed statistical distributions, often called the 
distribution method. For example, a change in QoL exceeding a 
specified number of standard deviations is deemed to be the mini¬ 
mal clinically important difference. Many might argue that using 
something like a standard deviation as a yardstick for assessing a 
change in QoL is not too different from evaluating a level of sta¬ 
tistical significance. Given that a major objective of trying to 
establish the minimal clinically important difference is to allow 
separation of clinical significance from statistical significance, using 
a statistical approach for an essentially clinical question appears to 
be a logical fallacy. 

An interesting example of the challenges of assessing the clini¬ 
cal importance of changes in QoL scores comes from the Trial to 
Reduce Cardiovascular Events With Aranesp Therapy (TREAT) 
study [4]. In this study of patients with diabetes, chronic kidney 
disease, and anemia, 4,038 patients were randomly assigned to dar- 
bepoetin alfa, with hemoglobin level of 13 g/dL or placebo, with 
salvage darbepoetin alfa for levels under 9.0 g/dL. No differences 
in the primary end points (composites of death or cardiovascular 
events and death or end-stage renal disease) were seen with the 
study. The main prespecified patient-reported outcome was the 
change in patient-reported outcomes at week 25 in the Functional 
Assessment of Cancer Therapy-Fatigue (FACT-Fatigue) instru¬ 
ment (on which scores range from 0 to 52, with higher scores 
indicating less fatigue). Among patients with both baseline and 
week-25 scores, from a baseline score of 30.2 in the group of 1,762 
of 2,012 (87.6 %) patients assigned to darbepoetin alfa to a base¬ 
line score of 30.4 in the 1,769 of 2,026 (87.3 %) patients assigned 
to placebo, there was a greater degree of improvement in the mean 
(±S.D.) score in the darbepoetin alfa group than in the placebo 
group (an increase of 4.2 ±10.5 points vs. 2.8 ±10.3 points, 
P< 0.001 for between-group changes). An increase of three or 
more points (“considered to be a clinically meaningful improve¬ 
ment”) occurred in 963 of 1,762 patients assigned to darbepoetin 
alfa (54.7 %) and 875 of 1,769 patients assigned to placebo 
(49.5 %) (P= 0.002); though the latter comparison was of subsid¬ 
iary importance in the trial, many interpreted the trial as having 
shown a statistically significant difference of borderline clinical 
importance, not least because 19 patients (100/(54.7-49.5) 
would have to be treated for 1 more patient to achieve a change in 
FACT-Fatigue score that was clinically meaningful [5]. 
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Given that fatigue is such a debilitating problem in patients 
with chronic kidney disease and defining clinically important 
change in FACT-Fatigue score anemia, two questions immediately 
present themselves: What is the provenance of the >3 points 
change requirement to be considered clinically meaningful? Are 
data from within the trial available that could allow patients to 
judge what 1.4-point difference in change scores mean to them? 
Regarding the first question, the evidence for the three-point 
requirement for the FACT-Fatigue scale comes from retrospective 
analyses from a heterogeneous group of studies in patients with 
cancer, including patients that participated in a nonrandomized 
validation study of the FACT-An scale; patients from a nonran¬ 
domized, observational study of chemotherapy-induced fatigue 
and patients from a community-based clinical trial of an interven¬ 
tion for anemia in patients with cancer [6]. While the use of the 
description “anchor” in the study might suggest to an unwary 
reader that patient input was sought in determining clinically 
important differences in QoL scales, this was not uniformly the 
case, as the three anchors consisted of blood test (hemoglobin 
level), a physician-based assessment of overall functional status (the 
Karnofsky score), and an evaluation of whether the Fatigue sub- 
scale changed in parallel with the overall FACT-An scale within the 
same patient. It seems hard, then, to conclude, that >3 is a valid 
estimate of the minimally important clinical difference for the 
FACT-Fatigue scale in patients with anemia, diabetes, and chronic 
kidney. 

Is it safe to conclude that a 1.4 point difference between the 
two treatment arms should be discarded as clinically meaningless? 
A subsequent report from the TREAT investigators was instructive 
in this regard [7]. This study included regression coefficients for 
FACT-Fatigue scores as a time-integrated outcome, measured at 
weeks 25, 49, and 97 of the study. In this patient group, each addi¬ 
tional year of life was associated with a decline in FACT-Fatigue of 
0.073 years. Thus, a treatment difference of 1.4 points would be 
equivalent to 1.4/0.073 or 19.2 additional years of age. Using a 
similar approach, the treatment effect in the TREAT trial exceeded 
the fatigue associations of having recreation activity classed as 
heavy or medium (vs. none or light), and the presence of overt 
pulmonary disease or cardiovascular disease (Table 1). In this 
framework, a change of 1.4 points on the FACT-Fatigue score may 
not be trivial after all. Thus, observational models of changes in 
key QoL parameters may be very useful for gauging the clinical 
importance of treatment effects in randomized trials. 

In conclusion measurement of QoL in randomized trials is 
important, but care must be taken to prespecify outcomes, to 
develop strategies to account for missing data and to understand 
the clinical importance of statistically significant changes in QoL. 
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Table 1 

Multivariate repeated measure model for FACT-Fatigue changes at 25,49,97 weeks (N= 3,531) in the 
TREAT trial darbepoetin therapy in patients with CKD, diabetes, and anemia (7) 


FACT-Fatigue change (n=3,531) 

Model covariates (P) 

Estimates (SE) 

Randomization to darbepoetin 

1.001 (0.402) 

0.013 

History of cardiovascular disease 

-0.990 (0.299) 

<0.001 

Baseline FACT-Fatigue score (per 1 unit increase) 

-0.504 (0.012) 

<0.001 

Baseline factors 

Race (black versus white) 

1.108 (0.382) 

0.004 

Recreation activity (heavy/medium versus other) 

1.066 (0.320) 

<0.001 

Age (per 1 year increase) 

-0.073 (0.016) 

<0.001 

Pulmonary disease 

-1.299 (0.329) 

<0.001 

Any nonsedentary job activity 

0.933 (0.338) 

0.006 

Baseline diastolic BP (per 1 mmHg increase) 

-0.044 (0.015) 

0.004 

Body mass index (per 1 kg/m 2 increase) 

-0.045 (0.021) 

0.030 

Baseline white blood cells (10 9 /L) 

-0.191 (0.070) 

0.006 

Baseline triglycerides (per 10 mg/dL) 

-0.070 (0.019) 

<0.001 

Baseline potassium (mmol/L) 

0.535 (0.234) 

0.023 

Baseline serum ferritin (per 10 pg/L) 

0.012 (0.006) 

0.036 

History of diabetic nephropathy 

-0.570 (0.293) 

0.052 

History of atrial fibrillation 

-0.999 (0.480) 

0.038 

Known duration of diabetes (per 1 month increase) 

-0.003 (0.001) 

0.013 

Postrandomization factors 

Interim stroke 

-5.040 (1.300) 

<0.001 

Number of hospitalizations 

-1.007 (0.128) 

<0.001 

Any hemoglobin <9 g/dL status 

-1.100 (0.314) 

<0.001 


Estimates were from repeated-measure model adjusted for baseline FACT-Fatigue scores, stratification factors, treatment 
groups, covariates listed in Table 1, and postrandomization factors (i.e., interim heart failure event, number of days 
transfused, interim myocardial ischemia/infarctions, and development of ESRD). n Number of subjects in PRO FACT- 
Fatigue analysis set. With permission from Fewis E et al. (2011). Darbepoetin alfa impact on health status in diabetes 
patients with kidney disease: a randomized trial. Clin J Am Soc Nephrol 6:845-855 





272 


Robert N. Foley and Patrick S. Parfrey 


References 

1. McHorney CA, Ware JE Jr, Raczek AE (1993) 
The MOS 36-Item Short-Form Health Survey 
(SF-36): II. Psychometric and clinical tests of 
validity in measuring physical and mental health 
constructs. Med Care 31:247-263 

2. Fayers PM, Machin D (eds) (2007) Quality of 
life: the assessment, analysis and interpretation 
of patient-reported outcomes, 2nd edn. Wiley, 
Chichester, England 

3. Hays RD, Woolley JM (2000) The concept of 
clinically meaningful difference in health related 
QoL search. How meaningful is it? 
Pharmacoeconomics 18:419-423 

4. Pfeffer MA, Burdmann EA, Chen CY, Cooper 
ME, de Zeeuw D, Eckardt KU, Feyzi JM, 
Ivanovich P, Kewalramani R, Levey AS, Lewis 
EF, McGill JB, McMurray JJ, Parfrey P, Parving 
HH, Remuzzi G, Singh AK, Solomon SD, 
Toto R, Investigators TREAT (2009) A trial of 


darbepoetin alfa in type 2 diabetes and chronic 
kidney disease. N Engl J Med 361:2019-2032 

5. Marsden PA (2009) Treatment of anemia in 
chronic kidney disease-strategies based on evi¬ 
dence. N Engl J Med 361:2089-2090 

6. Celia D, Eton DT, Lai JS, Peterman AH, Merkel 
DE (2002) Combining anchor and distribution- 
based methods to derive minimal clinically 
important differences on the Functional 
Assessment of Cancer Therapy (FACT) anemia 
and fatigue scales. J Pain Symptom Manage 24: 
547-561 

7. Lewis EF, Pfeffer MA, Feng A, Uno H, 
McMurray JJ, Toto R, Gandra SR, Solomon SD, 
Moustafa M, Macdougall IC, Locatelli F, Parfrey 
PS, TREAT Investigators (2011) Darbepoetin 
alfa impact on health status in diabetes patients 
with kidney disease: a randomized trial. Clin J Am 
Soc Nephrol 6:845-855 



Chapter 16 


Randomized Controlled Trials: Planning, Monitoring, 
and Execution 

Elizabeth Hatfield, Elizabeth Dicks, and Patrick S. Parfrey 

Abstract 

Large integrated multidisciplinary teams have become recognized as an efficient means by which to drive 
innovation and discovery in clinical research. This chapter describes how to plan, budget and fund these 
large studies and execute the studies with well-designed governance and monitoring protocols in place, to 
efficiently manage the large, often dispersed teams involved. Sources of funding are identified, budget 
development, justification, reporting, financial governance and accountability are described, in addition to 
the creation and management of the multidisciplinary team that will implement the research plan. 

Key words Clinical research, Randomized controlled trials, Management, Multisite, Multidisciplinary 
teams, Budgeting, Funding 


1 Introduction 


Evidence-based research is the primary mechanism utilized to 
inform health policy, improve health, and strengthen health care 
through a focused system of translational health research. 
Successfully funded proposals have four key elements: (1) A clearly 
stated research question, (2) Strategic alignment of the research 
question with the mission of the funding agency, (3) A clear descrip¬ 
tion of a well-thought-out experimental design, and (4) A cost 
effective budget that demonstrates maximum allocative and technical 
efficiency of the resources required [1,2]. 


2 Alignment of the Research Question with Agency Research Themes 

One of the key elements of successful proposals is the alignment of 
research questions with agency funding priorities. In Canada, four 
national health priorities have emerged and are defined as the four 
pillars of health research. These include biomedical, clinical, health 
systems, and population health and are to be investigated using a 
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problem based multidisciplinary approach. In addition to the 
national priorities, local and regional strategic health plans should 
not be overlooked as they will have their own unique set of priori¬ 
ties in addition to those defined at the national level. These will 
require information through evidence-based research in areas such 
as wait times and resource utilization and present opportunities for 
funded research. 


3 Sources of Funding for Clinical Research 

Funding sources vary from country to country, but first world 
countries usually have multiple sources including government, 
foundations, private industry, and professional organizations, for 
example in Canada, these include the following: 

3.1 Federal The Canadian Institutes of Health Research (CIHR), the Natural 

Sciences, Engineering and Research Council of Canada (NSERC)- 
Collaborative health Research projects (CHRP) Program and 
Genome Canada fund clinical research. The Canadian Institutes of 
Health Research is the government of Canada’s health research 
funding agency and reports to Parliament through the Minister of 
Health. It is comprised of 13 institutions, each with its own scien¬ 
tific director and advisory board. The mandate of CIHR is “to 
excel, according to internationally accepted standards of scientific 
excellence, in the creation of new knowledge and its translation 
into improved health for Canadians, more effective health services 
and products and a strengthened Canadian health care system” 
[3]. Across the 13 institutes, CIHR has a research budget of 
approximately $1 billion per annum and leverages approximately 
$100 million from partner agencies per annum. These combined 
sources fund training, salary, equipment, and operating grants. In 
2013-2014 approximately 50 % of CIHRs budget was awarded, 
$465.8M for open operating grants [4], which includes random¬ 
ized controlled trials. Additionally a number of new signature ini¬ 
tiatives were launched including Personalized Medicine, Patient 
Oriented Research, Community Based Primary Healthcare and 
the Canadian Epigenetics, Environment and Health Research 
Consortium [5]. Strategic initiatives launched included the 
Canadian Longitudinal Study on Aging, the Drug Safety and effec¬ 
tiveness network, and the Strategic Training Initiative in Health 
Research [3, 6]. By comparison in the United States, the National 
Institutes of Health invested $30.1 billion in medical research in 
2014 [7]. NSERC, reports to Parliament through the Minister of 
Industry. NSERC has invested over $6 Billion in basic research and 
university-industry projects in the past decade. It has an annual 
budget of $1.1 Billion and is the largest funder of science and 
engineering research in Canada [8]. It has established CHRP 
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3.2 Foundations 
(Public and Private) 


3.3 Private Industry 


3.4 Professional 
Organizations 


program in partnership with CIHR. This program funds interdis¬ 
ciplinary collaborative projects that benefit the health of Canadians 
through the translation of the research outcomes to health policy. 
In 2012-2013 NSERC invested $170.4M in Health and Related 
Life Sciences Technologies [8]. 

Genome Canada is the principle proteomics and genomics 
funding center in Canada and has received $700 million from the 
government of Canada over the last decade for investment in 
research. Genome Canada’s vision is “To harness the transforma¬ 
tive power of genomics to deliver benefits to Canadians.” 
Implementation of this vision is achieved by “(1) Connecting ideas 
and people across public and private sectors to find new uses and 
applications for genomics; (2) Investing in large-scale science and 
technology to fuel innovation; and (3) translating discoveries into 
applications to maximize impact across all sectors” [9]. Genome 
Canada invests in large-scale multidisciplinary research projects 
through a system of international peer review. In 2013-2014 it 
invested $47.6M in support of research projects, $15.7M for S& 
T Innovation Centres and $4.8M for base funding of the regional 
Genome Centres [10]. As part of its mandate it must raise 50 % of 
the funding required for any project from the investment of part¬ 
ners in the public, not-for-profit, and private sectors in Canada and 
abroad. In addition to research projects, Genome Canada also con¬ 
tinues to build national capacity with leading edge technical plat¬ 
forms which facilitate the design and implementation of more 
efficient genomic and proteomic methodologies. 

Foundations include charitable or not-for-profit agencies. Private 
foundations are usually funded by one major source, an individual, 
a family, or a corporation. Public foundations are usually, funded 
through multiple sources which includes private foundations, indi¬ 
viduals, and government agencies [11]. There are many founda¬ 
tions which provide grants for health research and often their 
funding is directed toward specific research priorities. A search of 
the web, lists in the United States alone over 60,000 foundations 
which provide research grants. 

These include pharmaceutical companies and equipment manufac¬ 
turers primarily. Often these companies will provide unrestricted 
grants in response to specific requests for funding. In addition, 
they also provide contractual funding to academic researchers to 
conduct randomized controlled trials (RCT) as part of multina¬ 
tional networks. 

Professional organizations often provide small amounts of funding 
often between 10,000 and 45,000 that can be used as seed funding 
to develop a research question and conduct preliminary work. 
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A number of mechanisms exist to track calls for proposals. 
Almost all agencies publish calls for proposals on the web and have 
online searchable databases which list funding opportunities and 
identify the research themes, relevant procedures, forms, and dead¬ 
lines for submission of research proposals for funding. Foundations 
and corporations also publish annual reports and newsletters which 
include outlines of funded projects. Professional associations pro¬ 
vide information on funding opportunities within their organiza¬ 
tion and also those available through major funding agencies and 
foundations, via the web, and periodic newsletter. In addition 
funded peers within a professional organization can be a valuable 
source of information on potential sources of funding available. 
A review of professional literature also can help anticipate new 
research trends that will be funded. Most academic institutions, 
research hospitals, and other research institutions provide an 
annual web-based public report of funded projects which list fund¬ 
ing sources and research projects. Memorial University of 
Newfoundland provides such a report annually http://www.mun.ca/ 
research /publications/matters .php . 


4 Budget Development and Justification 

A well-thought-out budget is a critical component of a successfully 
funded project and it must demonstrate the most cost-efficient 
means to investigate the research question. Funding agencies 
require a carefully detailed budget and justification summarizing 
costs and describing why each item in the budget is needed to 
complete the work outlined in the proposal and the time frame 
over which the item will be required. In addition, the justification 
must clearly articulate how the calculations were arrived at for each 
item. All budgeted items should be adjusted for inflation within 
the limits of the funding agency guidelines. Agencies often provide 
a budget template together with a list of guidelines for eligible 
expenses. These may vary by funding program within an agency. 
In addition to the requirements and procedures of the funding 
agency, the investigator should also be aware of his or her institu¬ 
tional policies and guidelines when preparing the budget. 

4.1 Direct Costs Budgets have a number of basic elements in common; these include 

direct costs which include those required to complete the work 
outlined in the proposal, such as salaries for personnel, consum¬ 
ables, services, equipment, travel, and renovations. 

4.1.1 Personnel In terms of personnel the number and types of personnel and time 

allocation calculated in full-time equivalents must be identified 
annually. Salary and Fringe Benefits costs for each type of personnel 
required are available from the investigators institution human 
resources division. Where multiple institutions are involved, this 
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becomes more complex but is required as rates differ across institu¬ 
tions. Fringe benefits rate which include group medical, life, and 
disability insurance may account of between 20 and 25 % of salary 
costs depending on the type of personnel. It is also important to be 
aware of the status of professional bargaining agreements, as renewed 
agreements are often applied retroactively. This should be factored 
into budget costs for personnel, particularly since personnel costs 
may represent between 65 and 75 % of total operating budget. 

For each person to be hired, or for each type of personnel 
required, the justification should include in addition to salary costs, 
the title the person will have on the project, their name, if known, 
degree(s), the experience, and expertise the individual brings to the 
project, as well as a description of the responsibilities of the position. 

4.1.2 Services Many projects include a budget item for services, for example costs 

for services not part of usual clinical care, such as a specific blood 
panel or EKG stipulated in the protocol. The fee for service is 
negotiated in writing with the hospital and the mechanism is put in 
place for billing the project. 

4.1.3 Consumables Consumables may represent 20-25 % of an operating budget and 

include lab and office consumables required to do the work. 
Individual expenses for consumables greater than $1,000 should 
be justified clearly in the budget. 

4.1.4 Equipment Where lab equipment is an allowable expense, each piece of equip¬ 

ment should be justified with a full explanation of what it will be used 
for and vendor quotes should be supplied, outlining costs, taxes, 
shipping, and maintenance agreements. The investigator must also 
take into account whether basic infrastructure equipment is required 
such as computers and software, desks and chairs, telephones, and 
identify what is available in existing office equipment. There are often 
associated institutional guidelines regarding the purchase of equip¬ 
ment with requirements for tendering for single equipment purchases 
of $10,000 or greater, which may vary by institution. 

4.1.5 Travel Travel is often required for investigators and trainees for dissemi¬ 

nation of results to their peers at academic meetings, and also for 
the provision of research results back to the community of study 
participants and their families. This kind of travel may include 
costs for airfare, taxis, and overnight accommodation and per 
diem rates. In addition patient travel costs incurred to participate 
in the study should be included where necessary. Institutional 
guidelines will determine the rates for per diems, allowable taxi 
rates, gas, or mileage. 

Depending on the funding agency, minor renovations budgets 
are allowable expenses such as improving existing lab space to 
accommodate required equipment. These kinds of budgets often 
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have maximum budget threshold defined by the agency. Some 
types of funding programs are designed entirely for space and large 
infrastructure proposals, such as the Canada Foundation for 
Innovation Research Hospital fund. 

4.2 Indirect Costs Indirect costs are associated with infrastructure and include over¬ 
head for space and equipment provided by the institutions. These 
costs are often calculated as a percentage of the total direct costs. 
Indirect costs are negotiated in advance between the institution 
and funder, and the rates are often standardized. 


5 Reporting and Governance 

Following notification of grant approval, which usually takes between 
3 and 6 months following submission of the full proposal, the inves¬ 
tigator will receive a notice of award together with reviewers’ com¬ 
ments. This is generally copied to the institution’s Office of Research. 
The notice of award will outline the approved final budget and if 
reduced from the original request, the investigator will be asked to 
adjust the budget allocation appropriately and outline whether the 
reduction will negatively impact on the scope of work proposed. 
Upon acceptance of the award by the investigator, the funding 
agency will release the funding to the institution with certain provi¬ 
sions in place which are covered in a memorandum of understand¬ 
ing between the institution and the funding agency. In the case of 
CIHR these include the following: (1) Prior to release of funding to 
the investigator, the institution must ensure that the investigator has 
full ethics approval to conduct the study and that the required infra¬ 
structure is in place. (2) The institution must ensure that expendi¬ 
tures allocated to the grant meet eligibility criteria as defined by the 
funding agency and by the institution. (3) The institution must pro¬ 
vide a financial statement of account to the funding agency 1 month 
after the fiscal year ends or April 30th for each year of the grant. 
Funding is released in annual allocations as outlined in the budget. 

The investigators and team need to familiarize themselves with 
the budget, be able to identify eligible and non eligible expenses, 
and understand the reporting and accounting spreadsheet formats 
utilized to track and report spending activities and the nature of 
financial accountability. 

Generally, the funding agencies provide one additional year to 
allow utilization of small amounts of surplus funding remaining at 
the conclusion of the funding period to allow a wind down phase. 
The investigators may have further reporting responsibilities to the 
funding agency and this may require the co-submission of scientific 
progress reports at the end of each fiscal year and upon completion 
of the project. 
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For projects supported by multiple sources, reporting and 
governance regulations may be much more demanding. In the case 
of Genome Canada, a collaborative research agreement (CRA) is 
signed between the institutions and the regional Genome centre. 
The CRA, in addition to outlining those details above, requires 
that the investigator and institutions provide very detailed financial 
statements of accounts to Genome Canada on a quarterly basis as 
well as details for each of the co-funders’ budgeted expenses. 
A detailed explanation of variance is also required for variances of 
15 % or greater from the original budget defined for each activity. 
Scientific reports outlining progress toward milestones are also 
required on a quarterly basis. Where multiple sites are involved in 
such a project it is wise to invest in a project manager who can 
coordinate these activities. Genome Canada funding is released by 
the regional Genome Centre on a quarterly basis following review 
of the reports. 


6 Management of Clinical Research Projects 


6.1 Overview 


6.2 Leadership 
and Organization 


Interdisciplinary or multidisciplinary studies have emerged as a 
means to fuel innovation in research and facilitate scientific prog¬ 
ress [11-13]. If funded, the research plan has been clearly articu¬ 
lated and the research team is defined. Key to implementing the 
plan successfully with large teams is the efficient communication of 
the research objectives and the budget that is meant to support 
these. This should be communicated to the entire team, including 
staff, trainees, collaborators, and co-investigators and each mem¬ 
ber of the team must understand what their individual and collec¬ 
tive responsibilities are. The following sections describe how 
productive teams can be built and managed effectively to ensure 
the success of the research project. 

Multisite projects, requires an overall team leader, or principal 
investigator who directs the project and is responsible for the 
project. This individual is usually someone who has a proven 
track record of leading these kinds of studies and has enough 
time to commit to the study. Large multidisciplinary and multi- 
site projects integrate discipline teams each with its own team 
leader. In addition to the principal investigator (team leader), 
each individual site requires a co-investigator who directs site 
operations. 

A clearly defined governance framework is invaluable to the 
seamless implementation and ongoing oversight of a multisite 
study comprised of the following components: 


6.2.1 Executive Steering Multisite studies require an executive steering committee which is 
Committee composed of the Principal investigator and co-investigators [14] 
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6.2.2 Scientific 
Advisory Board 


6.2.3 Clinical Endpoint 
Committee 


6.2.4 Data Safety 
and Monitoring Committee 


6.2.5 Independent 
Biostatistics Group 


6.2.6 Central Laboratory 
and Biobanking 


6.2.7 Study Sites 


6.2.8 Data Management 


and they are responsible for management of the project. They 
oversee the design, execution, and analysis of the study, and report 
and communicate the study results in conjunction with the study 
sponsors. 

A scientific advisory board with relevant areas of expertise is often 
put in place to consult with the steering committee on overcoming 
challenges and barriers to achievement of milestones when 
necessary. 

For randomized controlled trials a Clinical Endpoint Committee is 
put in place to adjudicate achievement of study endpoints in an 
unbiased and consistent manner according to prespecified end 
point criteria outlined in the trial protocol. 

Most funding agencies including NIH and CIHRas well as national 
health regulatory organizations including the FDA and Health 
Canada, require the inclusion of a Data Safety and Monitoring 
Committee (DSMC) for interventional studies (e.g., RCTs) and 
also for some observational studies (OSMC). For RCTs, the 
DSMC is responsible for monitoring the quality of the data and 
evaluating the efficacy and safety of the study intervention, and 
they will make recommendations to the Executive Committee 
regarding interim analysis and early termination. 

For multinational studies, an independent biostatistics group to 
support the DSMC through independent analysis of safety and 
efficacy study data is often put in place. 

Where biospecimens are collected, a central laboratory is required 
for biospecimen management, analysis, and biobanking as defined 
for each of these components under the study protocol. 

The study sites are responsible for recruitment of participants, and 
ethical conduct of the study in accordance with the study protocol 
and all applicable guidelines. 

For large multisite studies, a data management center is recom¬ 
mended to integrate data collected and ensure consistent application 
of pre specified ISO data standards across sites, as well as to ensure 
ongoing data quality and control. The data management center 
will ensure confidentiality and privacy of the data is maintained as 
defined by national and provincial/ or state legislation by the 
provision to the sites of appropriate data collection and transfer 
protocols to collect and enter the data to the central platform and 
transfer data collection documents such as abstraction forms to the 
DMC for quality control and backup. 
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6.2.9 Interactive Voice 
Response System 


Data transfer agreements should be put in place as part of the 
site agreements defining; ownership of the data, standardized data 
entry protocols including comprehensive data dictionary and 
protocols for secure transfer of digital and hardcopy data. 

Randomization of study participants, assignment of investigational 
product dosing and drug supply management is handled by an 
automated interactive voice response system. 

In addition to this, a study coordinator or project manager is 
required who will be responsible for establishing and maintaining 
ongoing communications between the teams and across sites, and 
who will have responsibility for monitoring and reporting on prog¬ 
ress, ensuring adherence of the sites to the study protocol, and 
achievement of milestones, as well as identifying challenges and 
barriers for collaboration between principals. Each site as well will 
require its own site coordinator responsible for implementing 
site operations and communication of challenges to the study 
coordinator. 

Hiring team members defined by the roles and mix of skills as 
outlined in the project proposal must be timely. Successful imple¬ 
mentation of the research plan requires an effective communica¬ 
tion strategy during study startup which must be led by the 
executive steering committee of the study. Such a communication 
strategy should include focused team meetings, a shared network 
where tasks are assigned, and completion of milestones monitored 
and acknowledged, and communication between team members is 
encouraged. Microsoft’s SharePoint is a good example of collab¬ 
orative software that allows sharing of data files by team members 
via secure web access. 

Don’t discount the importance of support staff. Clinic-based 
staff such as nurses or attendants may be the first point of contact 
to help recruit patients or to notify investigators when a participant 
is admitted to hospital. Take the time to present your study and 
orientate these groups to the project. Although they are not paid 
team members, oftentimes a small contribution to an education 
fund or the donation of a particular book to their unit will enhance 
collaboration. 

A successful and productive team has to have a collective 
commitment to a common goal inspired by its leadership to over¬ 
come challenges. Working together, a team demonstrates a shared 
leadership role, individual and mutual accountability, a clearly 
defined objective that the team delivers, a sense of shared commit¬ 
ment and purpose, collective work products, a work environment 
that encourages opened ended discussion and active problem 
solving, and measurement of performance through assessment of 
collective work-products [15]. 
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6.3 Distribution 
of Funding 


6.4 Training 
and Orientation 


6.5 Ethics 


Following the disbursement of funding from the funding agency 
to the lead institution and ethics approval, funding is released to 
the principal investigator. An account is established for the investi¬ 
gator to which he can apply eligible expenditures outlined in the 
budget. Where multiple sites are involved, individual site budgets 
must be negotiated based on economies of scale, during the grant 
writing phase. The lead institution provides the collaborating insti¬ 
tutions with separate inter-institutional agreements or sub¬ 
contracts which outline the site budgets, the scope of work, and 
the reporting required, authorship guidelines, and data transfer 
clauses Participating institutions’ legal, administrative, and risk 
management offices will review the agreements and make recom¬ 
mendations for changes to the agreements based on institutional 
requirements if needed. When this has been completed, the insti¬ 
tutional signing authorities, the principal investigator, and the site 
investigators may complete signing on the agreements. When the 
lead institution receives confirmation of ethics approvals, funding 
is released to the sites. 

During study startup each component leader and staff member 
should have received a copy of the study binder which includes 
the study protocol, a copy of ethics approval, a copy of the budget 
and copies of any other associated documentation, including data 
abstraction and collection forms, and study questionnaires. The 
written protocol should be thought of as the Standard Operating 
Procedures for the study. Any subsequent protocol changes involving 
study participants must be reviewed and approved by the site 
research ethics boards prior to their implementation. The revised 
protocol must then be circulated to the entire team, and a dated 
revision of the protocol with (ethics approval) archived to a secure 
site. It is also useful to include a brief statement regarding why the 
changes were made, particularly, for studies that span several years. 
The importance of this documentation is that people who were 
involved in changing the protocol may move and with them, the 
rationale for the changes. It is important to keep documented track 
of protocol changes in this manner because the changes made will 
need to be identified in the methods section as manuscripts are 
written, and because it is necessary for the orientation of new staff. 

As discussed in another chapter, all research involving human 
investigation requires application to an ethics board for approval. 
Multiple site involvement is complex, and timelines for submission 
must be achieved to ensure all sites startup in as timely a manner as 
possible. A standardized ethics submission should be prepared by 
the study coordinator and distributed to the sites with the original 
protocol and any subsequent amendments. In Canada, most ethics 
boards have similar requirements and if the coordinating center of 
a multisite project receives approval, in all likelihood the sites will 
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6.6 Memorandum 
of Understanding 
(MOU) 


6.7 Execution 
of the Research Plan 


receive approval as well from their own ethics boards. However, 
this is not always the case, and it is incumbent on each site coordi¬ 
nator to ensure that the specific guidelines for ethics submissions at 
each site are followed. 

The study coordinator and/or project manager is responsible 
for maintaining copies of all ethics approvals submitted across sites 
to ensure institutional and funding agency guidelines are met. In 
addition each site coordinator must maintain copies of their own 
site ethics submissions as part of routine study documentation. 

For RCT’s many ethics boards also require registration of the 
trial on a publicly available clinical trials registry such as 
ClinicalTrials.gov. Details of the trial including the sponsorship, list 
of investigators, purpose, population, sample size, outcomes, and 
publications are listed. The site documentation for the trial is main¬ 
tained by the trial coordinator over the life of the study. 

Verbal agreements between study personnel and departments or 
institutions may be expedient in getting the project started; how¬ 
ever, failure to have a written MOU may prove disastrous later as 
the project evolves. An MOU may help when people change posi¬ 
tions (from intellectual and career perspectives) which may put the 
project at risk. Therefore, to ensure what is negotiated in the early 
stages of the project continues throughout its life an MOU should 
be negotiated. The MOU does not need to be a legal contract 
involving lawyers, but simple documentation between the team 
and the agreeing party or institution. The MOU will need to 
describe what has been agreed upon detailing the contribution for 
both sides and the duration. If monetary stipulations are also 
agreed upon at the beginning of the study, it is better to have it 
documented in an agreement. Also, policies concerning authorship 
should be recorded at the start of the project. 

The implementation phase of a study may take time but the steps 
taken at this phase should enhance the productivity of the group, 
and assure completion of the project. This phase of the study is also 
where the team needs to evaluate whether a pilot study is advisable. 
Although a pilot study entails added work, time, and expense, the 
information returned may be invaluable. The findings from the 
pilot study allow the investigator to determine if the protocol, 
which may look very good on paper, actually works in the real 
world. The researchers may find that accessing the participants in 
the manner they had designed may not be viable. The numbers of 
study participants planned to enroll may not actually be available, 
or the protocol may be so complex or time consuming that 
individuals see it as too difficult and decline to participate. If the 
study involves asking participants to complete a questionnaire(s), 
the questionnaire(s) should be piloted on a small sample of indi¬ 
viduals other than the study participants. For example, if one plans 
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6.8 Public Relations 


6.9 Networking 


6.10 Biospecimen 
Collection 


6.11 Data Collection 
and Management 


to send a dietary questionnaire to 150 males who are 75 years of 
age with colon cancer, one could test the questionnaire on 15 men 
in the same age group without the condition. This will allow one 
to determine if specific questions can be answered, or answered in 
the manner stated by the protocol. This small time saving maneu¬ 
ver may lessen the workload on the research team who may subse¬ 
quently have to re-contact the study participants to fill in the 
unanswered questions, or it may prevent finding at study conclu¬ 
sion that that vital information is incomplete. Input from partici¬ 
pants at the pilot phase will allow the researchers to evaluate and 
redesign certain pieces of the study, thereby saving time and energy 
before initiating the main study. 

Once the project is ready to start it is important to introduce it to 
the eligible population. It may be helpful to have different compo¬ 
nent leaders prepare presentations to various community groups to 
let them know about the study and how they might help. Oftentimes, 
local newspapers will publish a story on research that is being con¬ 
ducted within the community, or the university information officer 
might present it in their next bulletin. Other avenues that may help 
disseminate the study could be the local TV channel, or community 
groups affiliated with the disease of interest. Getting information 
regarding the project into the community will help educate poten¬ 
tial participants even before direct contact has started. 

If the study involves enrolling individuals for long term follow up, 
it is extremely important to maintain contact with them. It may be 
worthwhile to create a newsletter that could be forwarded to each 
participant every 6 months. These newsletters could highlight a 
different member of the team and their involvement, or explain the 
rationale for a particular blood test. As the project evolves and data 
is analyzed, results should be provided to participants as they 
deserve this, and it creates with them, a sense of value and pride. 

If the study is a one-time project, a newsletter could be sent at 
the end of the study as it will demonstrate genuine appreciation for 
their participation. 

In large multisite studies, the protocol is written to include standard 
operating procedures for collection, coding, shipment, and storage 
of biospecimens to a central lab for analysis. This ensures that analy¬ 
sis of samples is standardized and not impacted by variations in ana¬ 
lytical protocol between labs. Data from the central lab is transferred 
to a centralized database for analysis as defined in the protocol. 

Each team or component leader or his designate is responsible for 
orientation and training of his team in the relevant sections of the 
protocol. This ensures consistent application and interpretation 
of the protocol and ensures standardized collection of data. 
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6.12 Achievement 
of Milestones 


6.13 Reporting 


Ongoing supervision of staff and oversight of data by the investi¬ 
gator or site coordinator is essential to ensure accuracy of the 
data. Data is best managed using a secure centralized data plat¬ 
form that allows the maintenance and linkage of large collabora¬ 
tive data sets using ISO data standards. Centralization of the data 
sets in this manner facilitates quality control of the data, as well as 
the rapid dissemination of preliminary results. The data set should 
be maintained and analyzed by those individuals on the team spe¬ 
cifically trained to handle the data. Tracking of key variables and 
interim data analysis also enables early identification of problems 
[16, 17]. However interim analysis of study outcomes must be 
preplanned and outlined in the protocol. 

Achievement of milestones within the timeframe specified in the 
original proposal may be taken into consideration by the funding 
agencies during interim review. Feasibility of the study may be 
questioned if it is impossible to recruit the target number of 
patients into the study across sites within the given time frame. 
Therefore the importance of timelines and their linkage to achieve¬ 
ment of milestones must be clearly communicated to the sites so 
that they can identify problems at an early phase. 

During the course of the study both financial and scientific progress 
reports are provided to the funding agencies at least on an annual 
basis. In addition these reports should be provided to the team as 
they are useful in identifying and developing strategies for resolving 
problems that crop up from time to time. In addition to the genera¬ 
tion of formal reports, it is important for the full team to meet at 
least once a year in person together with the stakeholders including 
funders to review progress While it is cost prohibitive for dispersed 
inter-disciplinary teams to meet regularly in person, it is productive 
to meet using a combination of tele- and web conferencing where 
results can be presented and discussed with the team. 

It is important at the outset of a study to establish authorship 
guidelines through the study steering committee and these are 
often aligned with funding agency and journal guidelines. All 
abstracts, presentations, manuscripts should be circulated to the 
steering committee for review and comment before submission. 

Dissemination of study outcomes through peer reviewed 
publication is essential for the ongoing success of the team and 
greatly enhances opportunities for further funding. A number of 
published guidelines, widely endorsed by the scientific community 
are available to help standardize reporting on a number of study 
designs. Many high impact journals require that manuscripts 
submitted for publication follow these guidelines. These guidelines 
include the Consolidated Standards of Reporting Trials 
(CONSORT) Statement for reporting randomized trials [18]. The 
CONSORT statement has the endorsement of over 600 journals. 
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The Strengthening the Reporting of Observational Studies in 
Epidemiology (STROBE) Statement. The STROBE statement 
provides guidelines for reporting cohort, case-control, and cross- 
sectional designs [19]. The Strengthening the Reporting of 
Genetic Association Studies (STREGA) [20] this statement builds 
on the STROBE statement and addresses population stratification, 
genotyping errors, and the Hardy Weinburg equilibrium, and 
treatment effects in quantitative traits. Reviewers and editors use 
these guidelines to assess strengths and weaknesses of the manu¬ 
scripts submitted and it helps to ensure complete reporting on the 
part of the authors. 
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Evaluation of Diagnostic Tests 

John M. Fardy and Brendan J. Barrett 

Abstract 

As technology advances, diagnostic tests continue to improve, and each year we are presented with new 
alternatives to standard procedures. Given the plethora of diagnostic alternatives, diagnostic tests must be 
evaluated to determine their place in the diagnostic armamentarium. The first step involves determining 
the accuracy of the test, including the sensitivity and specificity, positive and negative predictive values, 
likelihood ratios for positive and negative tests, and receiver operating characteristic (ROC) curves. The 
role of the test in a diagnostic pathway has then to be determined, following which the effect on patient 
outcome should be examined. 

Key words Diagnostic tests, Sensitivity, Specificity, Positive predictive value, Negative predictive 
value, Likelihood ratio, Receiver operating characteristic curve 


1 Introduction 


Diagnostic tests are used to increase the likelihood of the presence 
or absence of illness, to provide prognostic information and, in 
some situations, to predict a response to treatment. The ability of 
a diagnostic test to identify a potential underlying disorder depends 
not only on the characteristics of the test itself, but also on the 
particular situation in which it is used. The prevalence of the dis¬ 
ease in the population and the spectrum of the disease being sought 
may influence the way a diagnostic test performs. In this chapter 
the characteristics of diagnostic tests will be examined together 
with how these characteristics can be used to choose the most use¬ 
ful diagnostic tests. 

In order to determine the accuracy of a diagnostic test, an arbi¬ 
ter is necessary to decide whether the test result is correct or not. 
This is known as the “gold standard” or reference standard. In some 
instances the “gold standard” is an established test or combination 
of tests which confirms the diagnosis, while in other cases the “gold 
standard” requires follow-up over time to confirm or refute the 
diagnosis. When considering the characteristics of a diagnostic test, 
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one must consider the “gold standard” to which it is compared and 
determine whether or not it is an appropriate one. The comparison 
to the “gold standard” should be carried out in a blinded fashion so 
as to prevent bias in the interpretation of the diagnostic test or the 
reference standard. Further issues in the design of diagnostic accu¬ 
racy studies are discussed in a later section. 

In the past, assessment of diagnostic tests might have been 
limited to studies of accuracy, but it is now well recognized that 
tests form part of a diagnostic pathway. Test results are used to 
alter the probability of diagnoses in the context of what is already 
known about the case and the results of other tests that might have 
been completed at the same time. There is recognition that the 
results of groups of tests may not be independent. As such, the 
specific contribution of a particular test needs to be determined. 
This has recently been discussed by Moons and colleagues, where 
the information gain from adding a test can be quantified in terms 
of an increase in the area under the ROC curve (see below), net 
reclassification improvement, or by decision curve analysis [1]. 
Furthermore following studies of the clinical validity of a test, the 
clinical utility of the test then needs to be established [2]. There 
has been much recent literature on the best approach to assessing 
the clinical utility of tests. Such evaluations may include learning 
about the full range of effects of tests on patients: psychological, 
behavioral, and social effects together with the impact of subse¬ 
quent therapies on longer term health outcomes [3]. 


2 Diagnostic Test Accuracy Criteria 

2.1 Sensitivity The classic parameters used to characterize a diagnostic test are the 

and Specificity sensitivity and specificity of the test. The sensitivity of a test refers 

to its ability to identify persons with the disease. It can be defined 
as “the proportion of people who truly have a designated disorder 
who are so identified by the test” [4]. A very sensitive test is one 
which identifies most people with the disorder in question. A test 
which is very sensitive is prone to false positive results, i.e., it may 
incorrectly label people as having the disease when, in fact, they do 
not have it. 

The specificity of a test, on the other hand, refers to its ability 
to correctly identify the disease in question. It can be defined as 
“the proportion of people who are truly free of a designated disor¬ 
der who are so identified by the test” [4] . A very specific test would 
be unlikely to incorrectly label an individual as having the disorder 
in question if, in fact, they do not have the disorder. However, a 
test which is very specific is more prone to false negative results, 
i.e., it may fail to identify the disease in some persons who actually 
have it. 
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2.2 Positive 
and Negative 
Predictive Values 


Table 1 

Assessment of diagnostic tests using 2x2 contingency table 


Gold standard 



Positive 

Negative 

Total 

New test 

Positive 

True positive a 

b False positive 

a + b 

Negative 

False negative c 

d True negative 

c + d 


Total 

a + c 

b + d 



Sensitivity: a/(a + c), Positive predictive value: a/(a + b) 
Specificity: d/(b + d), Negative predictive value: d/(c + d) 


There is always a trade-off between sensitivity and specificity; 
as one increases, the other tends to decrease [5]. The higher the 
cut-off used to say a test is positive, the more specific the test 
becomes, but this higher specificity comes at a price. As the cut-off 
is increased, the sensitivity decreases and the test is more likely to 
miss affected individuals. In some situations, such as screening for 
a disease, a lower cut-off might be used to create a very sensitive 
test so as not to miss anyone with the disorder in question. In other 
situations, when using a test to confirm a diagnosis, a higher cut¬ 
off making the test highly specific would be more desirable so as 
not to incorrectly label anyone with the disorder. 

The sensitivity and specificity of a diagnostic test can be calcu¬ 
lated using information obtained by comparing the performance of 
a diagnostic test to a gold standard or reference standard. Typically 
these results are summarized in a 2 x 2 contingency table as shown 
in Table 1 . Such tables can of course be extended to illustrate the 
distribution of data at different test cut-offs. Sensitivity and speci¬ 
ficity are not directly influenced by disease prevalence, but are 
affected by the disease severity spectrum. A test that is sensitive for 
detection of advanced disease may be less sensitive for detection of 
earlier stages. An example would be the Chest X-Ray for detection 
of lung cancer. 

The sensitivity and specificity of a diagnostic test are useful to 
describe how well a test performs, but they do not give us much 
information on the significance of a positive or negative test for an 
individual patient. This information can be obtained from the 
positive and negative predictive values of the test. The positive 
predictive value describes “the proportion of people with a posi¬ 
tive test who have the disease” [5]. Similarly, the negative predic¬ 
tive value describes “the proportion of people with a negative test 
who are free of disease” [5]. These ratios are calculated across the 
table rather than down the table using the formulae in Table 1. 
These parameters are more useful to the clinician and the patient 
as they give the predictive value of a positive and a negative test. 
A test with a high positive predictive value makes the disease quite 
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likely in a subject with a positive test. A test with a high negative 
predictive value makes the disease quite unlikely in a subject with 
a negative test. 

Although the positive and negative predictive values of a test 
are intuitively more useful to the clinician and patient, the predic¬ 
tive values are less stable and are dependent on the prevalence of 
disease. This makes them less portable from population to popula¬ 
tion. It also means that positive and negative predictive values 
derived from a study may not apply to any given patient if that 
patient’s pre-test probability of disease differs from the prevalence 
of the disease in the study sample. 

2.3 Case Study Let’s take a hypothetical new test used to rapidly detect an infec¬ 

tious process usually diagnosed by a culture technique which may 
take up to a month to provide a result (this is the case for several 
newer tests for tuberculosis). In a cohort of affected and unaf¬ 
fected subjects in which the prevalence of disease is 50 %, how does 
the new test compare to the culture technique? The results in 
Table 2 show a new test with excellent sensitivity and good speci¬ 
ficity. This test would be a good screening test and a reasonable 
confirmatory test. The positive predictive value of 82 % and the 
negative predictive value of 88 % suggest the new test is quite ben¬ 
eficial to patients and doctors. 

However, if the prevalence of the disease is 10 % instead of 
50 %, and the sensitivity and specificity are the same, the positive 
and negative predictive values change as shown in Table 3. 
Although the negative predictive value has increased from 88 to 
98 %, the positive predictive value has dropped to 33 %. This test 
which was initially a very good predictor of disease when preva¬ 
lence was 50 %, has much poorer positive predictive value when the 
disease prevalence drops to 10 %. In fact, with lower disease preva¬ 
lence, the test produces twice as many false positives as true posi¬ 
tives. In general, diagnostic tests will function most efficiently 


Table 2 

Assessment of a new diagnostic test when prevalence of disease is 50 % 


Gold standard 




Positive 

Negative 

Total 

New test 

Positive 

Negative 

45 (a) 

5(c) 

10(b) 

40 (d) 

55 

45 


Total 

50 

50 

100 


Prevalence of disease = 50/100 = 50 % 

Sensitivity = a/(a + c) = 45/50 = 90 % 

Specificity = d/(b + d) = 40/50 = 80 % 

Positive predictive value = a/(a + b) = 45/55 = 82 % 
Negative predictive value = d/(c + d) = 40/45 = 88 % 
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Table 3 

Assessment of a new diagnostic test when prevalence of disease is 10 % 


Gold Standard 



Positive 

Negative 

Total 


Positive 

45 (a) 

90 (b) 

135 

New Test 

Negative 

5 (c) 

360 (d) 

365 


Total 

50 

450 

500 


Prevalence of disease = 50/500 = 10 % 

Sensitivity = a/(a + c) = 45/50 = 90 % 

Specificity = d/(b + d) = 360/450 = 80 % 

Positive Predictive Value = a/(a + b) = 45/135 = 33 % 
Negative predictive value = c/(c + d) = 360/365 = 98 % 


when the prevalence (or pre-test probability) is between 40 and 
60 % and provide much less information at the extremes of pre-test 
probability [5]. 

2.4 Likelihood Ratios The ideal test parameter would be one which has predictive value 

and is stable with changes in prevalence. The likelihood ratio is 
such a parameter. A likelihood ratio expresses the relative odds that 
a given level of a diagnostic test result would be expected in a 
patient with (as opposed to one without) the target disorder [4]. 
As with the other parameters, likelihood ratios are calculated from 
the 2x2 table. 

Likelihood ratio for a positive test 
LR + = (a / a + c) / (b / b + d) = sensitivity / (-1 specificity) 

Likelihood ratio for a negative test 

LR- = (c/ a + c) / (d / b + d) = (-1sensitivity)/specificity 

Because the likelihood ratios are calculated from the sensitivity 
and specificity, they are also stable with changes in prevalence of 
disease. The predictive value of the likelihood ratio calculates the 
post-test odds of disease from the pre-test odds of disease using the 
following formula: 

Post - test odds = Pre - test odds x LR + 

The pre-test odds of disease are similar to the pre-test probability 
of disease and can be calculated with the following formula: 

Pre - test odds = Pretest probability / (-IPre - test probability) 

The pre-test probability of disease is usually estimated from the 
clinical information or from published reports. 
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2.5 Overall Test 
Accuracy 


A diagnostic test with likelihood ratios near unity does not 
have much effect on the post-test probability of disease and there¬ 
fore is not very useful for decision making. On the other hand, 
very large LR+ or very small likelihood LR- ratios have a significant 
impact on the post-test probability of disease. An LR for a positive 
test of 10 or more means that a positive test is good at ruling in a 
diagnosis while an LR for a negative test of 0.1 or less means that 
a negative test is good at ruling out a diagnosis [6]. Likelihood 
ratios between 5 and 10 if test positive or 0.1-0.2 if test negative 
lead to moderate changes in the post-test probability while those 
between 2 and 5 (0.2-0.5) lead to smaller changes. 

The use of likelihood ratios to characterize diagnostic tests 
highlights the importance of the pre-test probability of disease in 
the performance of a diagnostic test. If the pre-test probability of 
disease is very high or very low, a diagnostic test will have to be 
very good to make a significant difference in the post-test proba¬ 
bility of disease. Diagnostic tests will perform best when the pre¬ 
test probability of disease is about 50 % and generally will perform 
less well at the extremes of pre-test probability [5]. If the pre-test 
probability of disease is so high or so low as to rule in or rule out a 
diagnosis, a diagnostic test is not warranted [6]. 

These various parameters used to characterize diagnostic tests can 
help in choosing one test over another, but they do not provide a 
summary estimate of the accuracy of the test. The receiver operat¬ 
ing characteristic (ROC) curve can be used for this purpose. An 
ROC curve is a plot of test sensitivity (plotted on the y axis) versus 
its false positive rate (1 - specificity) (plotted on the x axis) [7]. As 
the cut-off value for a positive test is moved up or down, the sen¬ 
sitivity and specificity of the test change. Figure 1 is an example of 
an ROC curve for a hypothetical diagnostic test. In this example, 
raising the cut-off value would lead to high specificity and low sen¬ 
sitivity, with coordinates toward the lower left hand corner of the 
curve. Lowering the cut-off value for a positive test would lead to 
a progressive increase in sensitivity and a progressive decrease in 
specificity moving up along the curve toward the upper right hand 
corner. The point on the curve closest to the upper left hand cor¬ 
ner (which represents 100 % sensitivity and 100 % specificity) 
would represent the cut-off value which offers the best balance 
between sensitivity and specificity. This may not always be the best 
cut-off to choose, depending on the purpose of the test. For a 
screening test, sensitivity would be favored over specificity, while 
for a confirmatory test specificity would be favored over sensitivity. 
In general one needs to consider the clinical impact of false positive 
and false negative test results and weigh these against each other to 
determine the most useful cutoff for any given context. 

The ROC curve also provides information on the overall accu¬ 
racy of the diagnostic test. The area under the ROC curve (the area 
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1-Specificity 

Fig. 1 Receiver operating characteristic (ROC) curve for assessing diagnostic tests 


to the right of the curved line in Fig. 1) is a popular measure of the 
accuracy of a diagnostic test [7]. The ROC curve area can take on 
values between 0.0 and 1.0, with an area of 1.0 representing a per¬ 
fectly accurate test. A test with an area of 0.0 is perfectly inaccu¬ 
rate; all patients with the disease have negative results, while all 
those without the disease have positive results. Such a test would 
have perfect accuracy if the interpretation of the test were reversed. 
Therefore, the practical lower bound for the area under the ROC 
curve is 0.5, which is bounded by the straight line from coordi¬ 
nates 0,0 to 1,1. This line is known as the chance diagonal on an 
ROC plot [7]. The area under the ROC curve can be used to 
compare the accuracy of diagnostic tests. It should be noted that in 
a given study the area under the curve is an estimate with an associ¬ 
ated standard error. This can be used to calculate confidence inter¬ 
vals around the estimated area and is also used when the areas 
under the ROC curves associated with different tests are being 
compared. Both parametric and nonparametric statistical proce¬ 
dures exist to compare areas under ROC curves, including adjust¬ 
ments for paired samples if the two tests being compared were 
completed within the same subjects [8,9]. 

If the concern is the accuracy of a test, the percentage of 
patients correctly classified by the test under evaluation can be 
assessed. In Table 1, accuracy can be calculated as follows: 


Accuracy : (a + d) / (a + b + c + d) 
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Unfortunately, the overall accuracy is highly dependent on the 
prevalence of the disease. Another option for a single indicator of 
test performance is the diagnostic odds ratio (DOR). This is the 
ratio of the odds of positivity in the diseased relative to the odds of 
positivity in the nondiseased [10]. Like the odds ratio in any 2x2 
table it is calculated using the following formula: 

DOR = ad / be 

There is also a close relationship between the DOR and the likeli¬ 
hood ratios: 


DOR = LR + /LR- (10) 

The value of the DOR ranges from 0 to infinity with higher values 
associated with better performance of a diagnostic test. A value of 
1 suggests that a test does not discriminate well between those 
with and without the target disorder, while values lower than 1 
suggest improper interpretation of the diagnostic test (more nega¬ 
tive tests among the diseased). As with likelihood ratios, the DOR 
is not dependent on the prevalence of disease, but like sensitivity 
and specificity is influenced by the disease spectrum in the study 
population [10]. The DOR can also be useful in meta-analysis of 
diagnostic studies. 

In all of the previous discussion it has been assumed that the 
reference or “gold” standard will yield a binary outcome of disease 
presence or absence. However this is not always the case, as for 
example when echocardiographically determined left ventricular 
mass as a continuous measure serves as the reference standard 
when evaluating features of the ECG as a diagnostic test. In that 
case, a different statistical approach has been proposed for estimat¬ 
ing sensitivity, specificity and the ROC curve [in. An alternate 
approach using information theoretical concepts also permits con¬ 
sideration of quantitative reference results while explicitly taking 
into account variation in pre-test probabilities [12]. 

In addition the reference standard itself may not always be per¬ 
fect, and in that situation the use of Bayesian Latent Class Models 
can allow evaluation of novel tests [13-15]. Recently a Web-based 
application has been developed to allow the less statistically accom¬ 
plished researcher to complete the required analyses via a user- 
friendly interface [16]. 


3 Design of Diagnostic Accuracy Studies 

Given the various tools available, how would one set out to evalu¬ 
ate a new diagnostic test? The criteria have been discussed in stan¬ 
dard textbooks of clinical epidemiology and are outlined below 
[4, 5]. These criteria center around a blinded evaluation of the new 
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test versus a “gold standard” in an appropriate population. The 
reproducibility and the interpretation of the test should be stan¬ 
dardized and the test procedure should be well described. Finally, 
the clinical utility should be documented. 

The importance of a blinded evaluation of the diagnostic test 
versus the reference standard is paramount in the evaluation of a 
new diagnostic test. Knowledge of the results of either the diag¬ 
nostic test or the reference standard could lead to bias when inter¬ 
preting the results of the other. Lack of a blinded comparison 
would invalidate the results of the study. 

The population chosen for study is also a critical factor in the 
assessment of a diagnostic test. Test performance will vary with dis¬ 
ease prevalence and with disease severity, such that diagnostic test 
performance often varies across population subgroups [17]. The 
sample population chosen for evaluation of the diagnostic test should 
be similar to the population for which the test is intended, in terms 
of both the prevalence and severity of the disease. The comparison 
group should be comprised of individuals from that group, those 
suspected of having the target disorder but not actually having the 
disease as opposed to “normal” individuals. In essence, the test 
should be evaluated under the same conditions that it will be used. 
Assessing test accuracy in samples selected to include cases with 
obvious or severe disease as well as healthy controls will tend to over¬ 
estimate the accuracy of the test under routine conditions. 

In studies of the accuracy of diagnostic tests, it is important 
that all members of the sample population undergo both the test 
being assessed as well as the “gold standard”. In a systematic review 
of the sources of bias and variation in diagnostic test accuracy stud¬ 
ies, Whiting and colleagues found that use of a case-control design, 
observer variability, availability of clinical information, choice of 
reference standard, disease prevalence, and severity as well as veri¬ 
fication biases were the major sources with generally greater impact 
on the estimate of sensitivity than specificity [18]. Methods for 
determining sample size for studies of the accuracy of diagnostic 
tests are tailored to the particular indices which are being studied. 
Sample size estimates can be calculated for several accuracy indices 
including sensitivity and specificity, the area under the receiver 
operating characteristic curve, the sensitivity at a fixed false positive 
rate, and the likelihood ratio [19]. 

The reproducibility of the test should also be evaluated 
particularly when it involves a subjective interpretation of the 
results. Both the inter-observer and intra-observer variation should 
be examined and evaluated with an appropriate measure, such as a 
kappa statistic, which reveals the degree of agreement between test 
readers. The test procedure should be well described so that it can 
be replicated by others. As well, there may be a significant learning 
curve associated with the interpretation of a new diagnostic test 
and this must be taken into account as the test is evaluated. 
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Given the plethora of studies that may exist evaluating the 
accuracy of a given diagnostic test, there has been interest in com¬ 
pletion of diagnostic test accuracy systematic reviews and meta¬ 
analyses. The challenges involved have been addressed by a 
Cochrane Methods group [20]. Tools were developed to assess the 
quality of the constituent studies [21, 22]. Challenges are often 
posed by heterogeneity in the design, setting, and results of the 
various primary studies. Care needs to be taken when formulating 
the questions for the systematic review. 


4 Factors Relevant to the Choice of Diagnostic Tests 

The choice of diagnostic tests is certainly influenced by test perfor¬ 
mance, but this is not the only important factor to be considered. 
Although a Ferrari may outperform the competition, its cost 
and seating capacity may make it unsuitable for the job at hand. 
In choosing a diagnostic test one must consider, in addition to test 
performance, the cost, availability, acceptability, and utility of the 
diagnostic test. A practical hierarchy can be defined based on (1) 
diagnostic power or performance, (2) availability and acceptability 
where considered relevant, and (3) cost [23]. 

Cost and availability are obvious concerns when one considers 
the choice of diagnostic tests. A very expensive test with limited 
availability would have to outperform standard tests by a wide mar¬ 
gin before it could be considered for routine use. The acceptability 
of the diagnostic test is also a major concern, particularly for the 
patient. An invasive test with potentially serious complications will 
not be accepted readily by patients, particularly if there is a safer, 
noninvasive alternative. One must also consider that information 
produced in research about diagnostic tests is utilized by several dif¬ 
ferent types of decision makers who are interested in different types 
of information [24]. Policy-making organizations will be more con¬ 
cerned with the “evidence-based” assessment and cost of testing, 
while patients may place more emphasis on anecdotal experience 
and the reassurance value of testing. Physicians will typically find 
themselves acting as representatives of the medical profession and 
its body of knowledge, and as advocates for each patient [24]. 

The final arbiter in the choice of diagnostic tests is the clinical 
utility of the test under scrutiny. Studies of diagnostic test accuracy 
may, on their own, provide sufficient information to infer clinical 
value if a new diagnostic test is safer or more specific than the old 
test, provided both are of similar sensitivity and that treatment 
based on results of the old test has been shown to improve patient 
outcomes in clinical trials [25]. Establishing whether a new test 
improves patient outcomes beyond the outcomes achieved using 
an older test or maybe no test prior to treatment may require the 
completion of randomized trials. A randomized trial can assess the 
outcomes of patients undergoing testing, document adverse 
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effects, and assess impact on management decision making and 
measure patient satisfaction and the cost-effectiveness of testing 

[26] . A variety of randomized designs have been proposed with 
the choice among them depending upon the objective of testing 
and whether alternative test/treat strategies are to be compared 

[27] . A framework for evaluating the links and mechanisms 
whereby outcomes are impacted in diagnostic test/treat trials has 
been proposed [28]. Concerns have been raised about the effi¬ 
ciency of some designs proposed for test/treat trials. It has been 
suggested that in the case where two tests are being compared in 
terms of clinical utility, a paired design in which each participant 
undergoes both tests with subsequent treatment only randomly 
assigned when the test results are discordant may be more efficient 
[29]. Sample size formulae for binary and continuous outcomes 
have also been proposed by the same authors [29]. Ethical issues 
that arise in relation to these trials include the need for equipoise, 
not so much with regard to the relative accuracy of tests, but rather 
with regard to the comparative health impact of alternative test/ 
treat strategies. In addition if a clustered design is followed, there 
is a need for those who decline participation to be aware that the 
whole diagnostic process in a particular clinic or hospital, for exam¬ 
ple, may be influenced by the assignment of that site to a novel 
test/treat strategy for the trial [30]. 

As commonly done in economic analyses, decision models can 
also be used to compare various test/treat strategies, but the results 
depend critically on the accuracy of the assumptions and estimates 
used to build and inform the models [3]. 


5 Conclusion 


Diagnostic test performance can be measured using a number of 
different instruments which assess the accuracy and predictive 
value of the tests. The choice of diagnostic tests, however, is more 
complex than a simple assessment of performance, and consider¬ 
ation of broader issues such as patient outcomes, acceptability, and 
cost-effectiveness of testing is necessary. By using the appropriate 
criteria to assess diagnostic test performance, followed by 
randomized trials to measure clinical utility, the choice of the best 
diagnostic test to solve a diagnostic problem can be made. 
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Chapter 18 


Qualitative Research in Clinical Epidemiology 

Deborah M. Gregory and Christine Y. Way 

Abstract 

This chapter has been written to specifically address the usefulness of qualitative research for the practice of 
clinical epidemiology. The methods of grounded theory to facilitate understanding of human behavior and 
construction of monitoring scales for use in quantitative studies are discussed. In end-stage renal disease 
patients receiving long-term hemodialysis, a qualitative study used grounded theory to generate a multilay¬ 
ered classification system, which culminated in a substantive theory on living with end-stage renal disease and 
hemodialysis. The qualitative data base was re-visited for the purpose of scale development and led to the 
Patient Perception of Hemodialysis Scale (PPHS). The quantitative study confirmed that the PPHS was 
psychometrically valid and reliable and supported the major premises of the substantive theory. 

Key words Clinical epidemiology, Grounded theory, Instrument development, Qualitative research 


1 Using Qualitative Research Methods in Clinical Epidemiology 

Over the past decade, the discipline of clinical epidemiology 
focused on evidence-based medicine and evidence-based health 
policy themes. Greater interest in qualitative research methods 
accompanied this trend [1] and can be partially attributed to the 
increased recognition given the role of psychosocial factors in shap¬ 
ing health outcomes. Focusing on physiological manifestations of 
disease to the exclusion of the total illness experience—behavioral, 
social, psychological, and emotional—is a rather limited view of 
what it means to live with a chronic illness. The primary objective 
of qualitative inquiry is to reconstruct the richness and diversity of 
individuals’ experiences in a manner that maintains its integrity 
(i.e., the truth value). As such, qualitative findings may be used to 
identify clinical areas requiring consideration and, in turn, facilitate 
the development of appropriate and timely interventions for modi¬ 
fying or resolving problem areas. 

Certain basic assumptions differentiate qualitative from quan¬ 
titative modes of inquiry. Although both types of inquiries use a 
variety of methodological approaches to generate data about 
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individuals’ experiences with events and situations, quantitative 
studies are concerned with explaining a phenomenon, whereas 
qualitative studies are interested in understanding and interpreting 
it. As well, objectivity and generalizability are the goals of quantita¬ 
tive science versus the subjectivity and contextualization goals of 
qualitative science. 

For researchers committed to the scientific paradigm of knowl¬ 
edge development, concerns with interpretive methods relate to 
the perceived shortcomings of objectivity, reliability, and validity 
versus the espoused rigor of the experimental method. A guarded 
acceptance of evidence from qualitative studies has been recom¬ 
mended until better guidelines are devised for data collection and 
analysis [2]. However, this is an inaccurate representation because 
there are explicit rules for enhancing the rigor and adequacy of 
qualitative inquires [3, 4]. 

Divergent philosophical traditions guiding qualitative researchers 
is one reason why one group may argue for precise quality criteria 
to assess rigor [5] and another group view such restrictions as 
limiting access to the full richness and diversity of individuals’ 
experiences [6, 7]. Most qualitative researchers operate somewhere 
in between this divide. In a review of validity criteria for qualitative 
research, Whittemore, Chase, and Mandle [4] argued that rigor 
can be combined with subjectivity and creativity if flexible criteria 
exist to support the basic tenets of interpretive research. Relevant 
criteria highlighted by the authors included credibility and authen¬ 
ticity (accurate reflection of experiences and differing or compa¬ 
rable subtleties within groups), criticality and integrity (critical 
appraisal that involves checking and rechecking to ensure data 
interpretations are true and valid), auditability or explicitness 
(specification of a decision trial in method and interpretation for 
other researchers to follow), vividness (capturing the richness of 
the data while striving for parsimony, so informed readers can 
appreciate the logic), creativity (imaginative but grounded in the 
data), thoroughness (sampling and data adequacy to ensure full 
development of themes and connectedness among them), congru¬ 
ence (logical link between question, method, and findings), and sen¬ 
sitivity (ethical considerations). The authors also highlighted 
techniques that could facilitate application of these criteria and lessen 
validity threats during study design (method and sampling deci¬ 
sions), data collection, and analysis (clarity and thoroughness, mem¬ 
ber checks, and literature reviews, among others), and presentation 
of findings (audit trail, rich, and insightful descriptions). 


2 Grounded Theory Methodology 

Grounded theory methodology was developed by Glaser and 
Strauss [8] and later refined by Glaser [9, 10] and Strauss and 
Corbin [11]. The primary objective of this method is to facilitate 
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greater understanding of human behavior and interactions within 
variant and similar contexts. The inductive-deductive approach to 
studying phenomena is focused on generating as opposed to test¬ 
ing theory. As conceptualized by Glaser and Strauss, substantive 
theory is seen as emerging from a substantive area of inquiry. For 
example, a substantive theory could be generated from exploring 
common perceptions shared by individuals comprising distinct 
clinical groups: patients with end-stage renal disease (ESRD) on 
long-term hemodialysis or members of families with the germ-line 
mutation for hereditary nonpolyposis colorectal cancer (HNPCC). 

The strength of grounded theory is that the interest is not on 
merely describing how individuals experience a particular phenom¬ 
enon like some qualitative approaches. What makes it unique is the 
emphasis placed on identifying and describing the social-psycho¬ 
logical processes grounded in the emergent data. That is, the focus 
is not solely with how illnesses, diagnostic procedures, or treat¬ 
ment protocols are experienced but rather how information about 
them is received and assimilated into existing belief structures in a 
way that becomes a stimulant for desired behavior, and makes it 
possible to achieve optimal health functioning. 

The key differentiating features of the method warrant consid¬ 
eration. First, grounded theory involves the simultaneous collec¬ 
tion of data through interviews and its analysis. This concurrent 
approach allows the researcher to use the constant comparative 
method of analysis to compare and interpret each piece of data 
with other pieces within and among interview transcripts until 
codes are refined and collapsed, outlier cases considered and 
rejected, and the groundwork laid for formulating substantive the¬ 
ory. Second, theoretical sampling is an important tool for data col¬ 
lection and analysis. This form of sampling involves the deliberate 
selection of participants based on their experience with the area of 
interest and the needs of the emerging theory [12]. Third, a lim¬ 
ited review of the relevant literature is completed prior to the 
research. To avoid prejudgments, an in-depth review is delayed 
until critical junctures in the analysis to help refine emerging 
constructs and position them, if possible, within existing theory 
(i.e., thematic categories guide the search for relevant studies). 

Glaser and Strauss [8] used category labels to describe groups 
of events or situations with common attributes. Categories are 
composed of properties, with incidents defining descriptors used 
to define properties. Transcripts are analyzed line by line and open 
codes, based on participants own words, inserted in relevant 
margins to help reduce researcher bias. These substantive codes are 
aligned with similar and dissimilar ideas, thoughts, or beliefs. 

In the second stage of analysis, open codes are collapsed, without 
altering identified themes, into key properties aligned with emerg¬ 
ing categories. Descriptors (i.e., grouping and collapsing of sub¬ 
stantive codes from incidents in the data), properties, and categories 
are constantly reassessed for validity. As the categories approach 
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saturation (no new insights are produced), theoretical sampling 
ceases. With further analysis, it is possible to delineate interrela¬ 
tionships among the categories, which eventually culminate into a 
theory’s propositions. A final step involves reflecting on the data 
and emergent categories for a core construct (dominant social- 
psychological process) that links things into a meaningful whole 
while accounting for most of the variation observed in the data. 

Researchers often revisit the thematic categories, defining a 
substantive theory for the purpose of scale construction. Within 
these categories are rich data clusters that enable the generation of 
items for scales that represent specific content or domains. 
Following scale validation through psychometric analysis, useful 
operational measures are available to test the theory’s propositions. 
For example, items for quality-of-care [13] and quality-of-life 
[14, 15] scales were generated from qualitative data bases. 


3 Example of Using Grounded Theory to Generate Substantive Theory 


3.1 Background 


3.2 Design 


3.3 Sample 
and Procedure 


Clinicians and researchers have been interested in documenting how 
experiences with end-stage renal dialysis (ESRD) and hemodialysis 
influence overall adjustment and the achievement of quality out¬ 
comes. The research evidence suggests that individuals on long-term 
maintenance hemodialysis are required to adapt to highly volatile 
illness and treatment experiences, a changing support base, and 
significant losses and lifestyle restrictions. Emotionally, psychologi¬ 
cally, physically, socially, and spiritually, there is a constant search for 
a sense of balance or normalcy. There is also evidence of a constant 
struggle to obtain a quality-of-life standard which can provide a 
benchmark for evaluating unpredictable events. 

A grounded theory study was designed to grasp an understanding 
of the meaning and significance of events and situations as defined 
by the “dialysis culture” for patients with ESRD. The primary pur¬ 
pose was to provide evidence for an interactive paradigm that views 
patients as free human beings who interact with all aspects of 
dialysis care. A secondary purpose was to identify periods of “criti¬ 
cal interactive moments” during hemodialysis and determine their 
impact on perceived care quality. 

From a population of 71 patients receiving hemodialysis at the 
study site during data collection (April-September 1996), 44 met 
the inclusion criteria (minimum of 12 weeks on dialysis, able to 
understand the interview process and study purpose and give 
informed consent, fluent in the English language, 19 years of age 
and over, and not experiencing an acute illness episode or signifi¬ 
cant decline in health). Semi-structured interviews of 60-90 min 
duration were conducted with all participants. Initial question 
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content focused on the total illness trajectory (personal reactions 
to illness, treatment, and critical events; informal and formal 
supports). Many additional questions were generated by the the¬ 
matic content emerging during the concurrent data analysis (sup¬ 
portive role of organizations, institutions, and predialysis clinics; 
suggestions to help prepare new patients for dialysis; decision-mak¬ 
ing about or availability of transplant and home dialysis options; 
conduciveness of physical environment; exposure to varying acuity 
levels of fellow patients). A second interview was scheduled 6-8 
weeks following the first to confirm interpretive summaries con¬ 
structed from each participant’s transcript, clarify identified gaps in 
the data, and confirm conceptual categories and properties. 

3.4 Data Analysis Data analysis proceeded in several phases. Taped interviews were 

first transcribed verbatim within 48 h and checked for accuracy. 
Immersion in the data was facilitated by listening to participants’ 
interviews while reading the transcripts. At the second step, the 
focus was on interpreting the meaning of words and sentences 
through reading and rereading each transcript. Integral to this pro¬ 
cess was assigning substantive codes to recurrent themes. This served 
two purposes: becoming immersed with each narrative to help 
construct interpretive summaries and identifying further probes and 
questions. 

Theoretical sampling indicated that common themes were 
emerging after completion of 15 interviews and first-level coding 
(i.e., substantive codes). At this point, interviewing was temporar¬ 
ily stopped and the constant-comparative method of analysis 
applied to the data sets by two independent raters. The objective 
was to create a meaning context by forging determinate relation¬ 
ships between and among codes (i.e., substantive codes highlight¬ 
ing the major processes present in the data). The result was a 
multilayered classification system of major categories and associated 
properties, descriptors, and indicators. As potential relationships 
between the categories were tested within the data, a substantive 
theory began to emerge. 

Additionally, each transcript was perused for critical events or 
“turning points” of sufficient magnitude to send a powerful 
message to participants at different points in the hemodialysis 
cycle. The data suggested that critical turning points could poten¬ 
tially alter attitudes toward treatment. Because turning points sur¬ 
faced across all the thematic categories, each transcript was 
subsequently perused to identify critical incidents that seemed 
integral to the category. Validity was assured by having two 
researchers construct independent interpretive summaries of each 
transcript and achieve consensus on the final version. An important 
focus was to capture the weight and importance of critical events 
for study participants. Participants were given an opportunity to 
read, or receive a verbal presentation on, their interpretive 
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summaries. All participants confirmed their interpretive summaries, 
adding a further element of credibility to the findings. 

Following initial application of the constant-comparative 
method and construction of interpretive summaries, debriefing 
sessions were held regularly to clarify the themes and emerging 
conceptual categories. This additional time with the data was 
intensive, resulting in multiple revisions of the initial categories 
and their properties and thematic descriptors (eight drafts). 
Category saturation (i.e., no new data emerging) and the begin¬ 
nings of a substantive theory (i.e., theoretical constructs and tenta¬ 
tive linkages) was achieved following coding of 30 data sets. 
Because the team wanted to ensure that all consenting patients 
were given an opportunity to share their views, data collection pro¬ 
ceeded until interviews were completed with all 36 patients. 

At the final step the focus shifted to enhance the credibility and 
accuracy of the classification system by subjecting it to examination 
by independent consultants. The initial verification session led to the 
collapsing of properties and categories into a more parsimonious set 
(i.e., from 10 to 7 categories and 48 to 36 properties) and descriptor 
labels added to differentiate meaningful divisions within properties. 
The revised classification system was then applied to 20 data sets; 
however, difficulties with overlapping categories continued to 
impede the coding process. Further discussions between the consul¬ 
tants and research team culminated in a further collapsing of catego¬ 
ries and properties (i.e., from 7 to 3 categories and 36 to 18 
properties). All data sets were subsequently recoded with the revised 
classification system and an interrater agreement of 95 % achieved. 

3.5 Findings The findings suggested that the experiences of patients with ESRD 

and on long-term hemodialysis could be captured with three cate¬ 
gories (meaning of illness and treatment, quality supports, and 
adjustment to a new normal). Adjustment to a new normal emerged 
as the core construct defining the social-psychological process. 
The substantive theory, living with end-stage renal disease and 
hemodialysis (LESRD-H), proposes that illness and treatment 
experiences and social supports exert a direct impact on adjustment 
to a new normal (see Fig. 1). It is also conjectures that critical turn¬ 
ing points (i.e., meanings attributed to positive and negative critical 
events that surface periodically to exert a singular and cumulative 
effect) link the constructs. Finally, all the constructs exert a direct 
impact on quality outcome with adjustment to a new normal also 
mediating the impact of illness and treatment experiences and social 
supports on outcome. 

All components of the substantive theory constantly change in 
response to alterations in health status, perceived usefulness of sup¬ 
ports, and an evolving new normal. Adjustment to a new normal 
encompasses how people view themselves and their roles in rela¬ 
tion to life on dialysis. Patients had to contend with ongoing 
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emotional, psychological, social, and spiritual struggles as they 
attempted to maintain a semblance of normalcy. The meaning of 
illness and treatment category captures the stress of dealing with 
the concurrent effects of ESRD, comorbid conditions, and hemo¬ 
dialysis treatment, as well as the ambiguity resulting from the ten¬ 
sion between knowing what is needed to maximize health versus 
what is actually done about it. The supports category reflects 
the perceived availability and usefulness of support from informal 
and formal network members. It refers to the caring approaches 
(i.e., technical, emotional, and psychological) used by significant 
others (family, friends, health care providers, and fellow patients) 
during dialysis and on nondialysis days. Quality outcome is an 
evolving end point with subjective (i.e., satisfaction with life) and 
objective (i.e., morbidity and mortality) components that are con¬ 
stantly changing in response to illness and treatment events, social 
supports, and adjustment. 

The thread linking the theory’s constructs was labeled critical 
turning points because of their import for shaping attitudes and 
behavior. Critical turning points ebbed and flowed in importance 
and impact in response to changing contextual factors (i.e., the 
physical environment: space, atmosphere; health status: perceived, 
actual; technical care: machine functioning, monitoring, response) 
and state of preparedness of the person for the event (i.e., aware of 
possibility for their unanticipated occurrence, knowing what to do, 
actual doing). As well, critical turning points in one area could 
shape subsequent perceptions of another area, and occurrences 
during a specific dialysis session, whether early or later in the treat¬ 
ment cycle, might or might not affect acceptance of this treatment 
type. Although isolated critical incidents early in the hemodialysis 
cycle (i.e., acute illness episode precipitating renal failure, loss of 
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3.6 Implications 
of Qualitative Findings 


transplant or alternate treatment modality, loss of access site, and 
loss of meaningful employment) were highly impressionable, the 
cumulative effects of a series of critical incidents over time seemed 
to have far greater implications for how patients rated the quality 
of supports and potential health outcomes. 

It might be helpful to work through an example to clarify what 
these separate but connected forces mean for a person living with 
dialysis. A sudden drop in blood pressure while on hemodialysis was 
identified by a number of respondents as a critical event. Although 
a physiological event that may be anticipated or unanticipated, 
depending on the track record of the person on dialysis, it is also a 
psychological event appraised cognitively first and emotionally sec¬ 
ond. Psychologically, one is driven to search for a causal factor. The 
search may attribute responsibility to the mistakes of others (i.e., 
removal of too much fluid or its removal too quickly, malfunction¬ 
ing equipment: quality of eare or soeiul support , the way that the 
person chooses to be in the world (i.e., not adhering to fluid and 
diet restrictions) or changing physical health status (i.e., comorbid 
illnesses such as coronary artery disease, diabetes: meaning of ill¬ 
ness and treatment. Emotionally, one identifies the feeling states of 
fear, anxiety, and uncertainty (i.e., a terrifying experience compa¬ 
rable to dying, fearing for one’s life, inability to control the event’s 
inception but possibly its severity). The emotional reaction may 
empower some individuals to assume actions that ideally reduce 
the severity of the event (i.e., be attentive to feeling states that 
constitute warning signs, alerting the nurse when detecting a 
change in physical status, ensuring that blood pressure readings are 
taken regularly). In contrast, other individuals may be so terrified 
and anxious about the uncertainty of event occurrences that they 
are in a constant state of tension while on dialysis (i.e., over- 
attentive to feeling states, constantly seeking attention from the 
nurses, too demanding): adjustment to a new normal. 

What do these critical turning points mean for dialysis patients? 
First, it is a dawning of sorts, because the person must confront his 
or her own vulnerability. Second, it speaks to the ‘'fragility” or limita¬ 
tions of technical care. Third, it impresses upon the person the need 
to be more vigilant about changing feeling states, healthy behavior, 
and the actions of health care providers. What is clear is that the 
person experiencing critical turning points is taken to a new level of 
awareness regarding the responsibilities of the self and others. 

The substantive theory’s constructs on meaning of illness and 
quality supports have been discussed extensively in the chronic illness 
literature and so augment theoretical work in this area. The chal¬ 
lenge facing health care providers is to identify care modalities 
capable of facilitating high-quality outcomes. As supported by 
study findings, appropriate care strategies are contingent on the 
social psychological processes defining patients’ experiences with 
illness and variant treatments. 
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4 An Example of Using a Qualitative Database for Scale Construction 


4.1 Justification 
for Deveioping 
a Ciinicai Monitoring 
Tool 


Irreversible kidney failure is a condition affecting increasing 
numbers of individuals, especially the elderly. Without treatment, the 
condition is fatal. Life can be prolonged with dialysis, but in accepting 
it, the person confronts many challenges. Prior research into patient 
experiences with ESRD and hemodialysis assessed physical and 
psychological stressors, methods of coping, quality of life, quality 
of supports, and satisfaction with care in a piecemeal fashion. 

This phase of the project was designed to develop reliable and 
valid scales for generating a descriptive database that would support 
the major premises of the model and, ultimately, provide health care 
workers with useful information at various points in the hemodialysis 
cycle. As noted in the preceding qualitative discussion, the findings 
supported the presence of a multidimensional construct. Therefore, 
items had to be generated that would capture how individuals inter¬ 
pret illness and treatment experiences, evaluate the quality of sup¬ 
port systems (formal and informal), and adjust to an evolving normal 
state. The importance of critical turning points suggested that the 
scales had to be capable of differentiating among individuals experi¬ 
encing and not experiencing problems within the identified thematic 
areas, as well as capturing responsiveness to change over time 
through natural evolution or planned intervention. This work 
culminated in the development of a testable version of the Patient 
Perception of Hemodialysis Scale (PPHS). 


4.2 Item Generation A set of disease- and treatment-specific items were generated from 

the qualitative database. This phase involved the following steps: 

1. Identification of an initial set of items from the three major 
thematic categories constituting the substantive theory. 

2. Reduction of initial items and determination of the best rating 
scale format for this population. 

3. Validation (content and face) of generated items by experts 
and ESRD patients. 


4.2.1 Step 1. 
Identification of an Initial 
Set of Items 
from a Qualitative 
Database 


Initially, coded transcripts were entered into a Paradox database file 
and transferred into the Statistical Program for the Social Sciences for 
descriptive analysis. This step led to the creation of a descriptive 
profile of frequency and priority ratings of categories, properties, 
descriptors, and indicators by subject and group. Based on the prior¬ 
ity ratings within and across data sets, the research team, composed 
of members from different professional backgrounds, once again 
reviewed the coded transcripts to identify phrases to guide formula¬ 
tion of item stems. This step resulted in the generation of 164 stems. 
As item construction proceeded, the emphasis was placed on concise¬ 
ness and avoidance of negative wording, ambiguous terminology, 
jargon, value-laden words and double-barreled questions. 
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4.2.2 Step 2. Reduction 
of the Initial Set of Items 
and Rating Scale Selection 


4.2.3 Step 3. Content 
and Face Validation 


4.3 Pilot Study 
and Preliminary 
Psychometric Analysis 


The first draft of the items was reviewed and modified by the 
researchers to increase clarity and diminish redundancy. This resulted 
in the elimination of 46 items. The revised version was content vali¬ 
dated by two hemodialysis patients who had participated in the qual¬ 
itative study and expressed an interest in this phase of the research. 
These individuals were also asked to comment on item clarity and 
relevancy. A direct outcome of this step was a further reduction of 
items from 118 to 98. 

The research team then proceeded to develop a rating scale to 
subject the items to a more rigorous pretest. The initial rating 
scales focused on frequency of occurrence (never, rarely, some¬ 
times, often, or almost always), as well as importance of select 
events and situations (not at all, a little bit, moderately, quite a bit, 
considerably). Although it was recognized that a five-point scale 
might not be sufficient for maximum reliability, the consensus was 
that it would be difficult to devise unambiguous additional ordinal 
adjectives. 

The four content experts in the field who were asked to review the 
scale confirmed the appropriateness of items for the identified 
domains. An adult literacy expert was also consulted to ensure that 
the scale was at the appropriate reading level for the target popula¬ 
tion. Finally, three hemodialysis patients were asked to participate 
in the rating of each item. One of the investigators was present to 
identify difficulties in administration, ambiguities in wording, and 
the time required to complete the task. It was determined that 
patients experienced difficulty discerning between the frequency 
and importance scales, adding considerably to the administration 
time. Following patient input, scale items were reduced to 64, 
ambiguous items reworded, and the rating scale modified to facilitate 
ease of application. 

The final version of the PPHS comprises 64 items, with 42 of 
the items positively worded and 22 negatively worded. The rating 
scale format that seemed to facilitate patients’ response was a five- 
point Likert-type scale ranging from 0 (never, not at all) to 4 
(almost always, extremely) with the lead-ins of how often, how 
satisfied, how concerned, or how confident. 

This phase of the research was primarily focused on generating 
data to conduct a preliminary psychometric analysis of the PPHS. 
The PPHS was pilot tested in a sample of patients receiving 
in-center hemodialysis at all provincial dialysis sites. One of the 
investigators trained an individual to administer the PPHS and 
the dialysis version of the Ferrans and Powers’ Quality of Life 
Instrument (QLI) [16]. Initially, interviews were conducted face 
to face with 112 patients during the first 2 h of hemodialysis 
treatment and ranged from 60 to 90 min. 
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4.3.1 Step 1. Validity 
Analysis of the PPHS 


The psychometric analysis of the data generated from the 
administration of the PPHS and the QLI in the pilot study pro¬ 
ceeded as follows: validity analysis of the PPHS, and reliability 
analysis of the PPHS. 

Construct validity of the PPHS was assessed in a number of ways. 
First, exploratory factor analysis was used to generate a factor struc¬ 
ture. With this analysis, it was possible to determine if scale items that 
formed factor structures aligned in a logical fashion with the thematic 
categories comprising the substantive theory. Second, correlational 
analysis (i.e., Pearson’s r and Spearman’s p) was used to examine the 
associations among the major subscales (i.e., constructs in the theo¬ 
retical model). Third, correlation matrixes were generated to deter¬ 
mine the intercorrelations among the components of each subscale, 
as well as the association of each component to its relevant sub¬ 
scale. The final step involved assessing the criterion-related validity 
(i.e., concurrent validity) of the PPHS with the QLI. 

Exploratory factor analysis tentatively supported the construct 
validity of the PPHS (i.e., generated three major item clusters—ill¬ 
ness and treatment experiences, social supports, and adjustment to a 
new normal—that supported the theoretical constructs comprising 
the substantive theory induced from the qualitative database). The 
analysis confirmed the following: (a) Illness and treatment experi¬ 
ences comprise four interrelated domains—physiological stressors, 
knowledge, performance of activities of daily living, and self-health 
management, (b) Social supports comprise two separate but inter¬ 
related domains—formal (nurses, physicians, and allied healthcare 
workers) and informal (i.e., family) networks, (c) Adjustment to a 
new normal consisted of two separate but interrelated domains— 
psychological distress and emotional well-being. 

Construct validity is also supported by the statistically signifi¬ 
cant, strong, positive correlations observed between the major 
subscales and the total scale: adjustment to a new normal (r= 0.91), 
illness and treatment experiences (r=0.77) and social supports 
(r=0.71). As hypothesized, the highest correlation is for the 
adjustment to a new normal subscale. These findings indicate that 
each major subscale measures separate but interconnected aspects 
of the whole patient experience. The subscales also depict low 
to moderate, statistically significant correlations with each other 
(rvalues range from 0.34 to 0.54). 

Construct validity is further supported by statistically significant, 
moderate to strong, positive correlations of minor subscales with the 
relevant major subscale. The score ranges within each major subscale 
are as follows: illness and treatment experience subscale (rvalues 
range from 0.41 to 0.78), social supports subscale (rvalues range 
from 0.43 to 0.82) and adjustment to a new normal subscale 
(r values range from 0.80 to 0.90). 
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4.3.2 Step 2. Reliability 
Analysis of the PPHS 


5 Summary 


Concurrent validity of the PPHS with the QLI is only partially 
supported with the pilot study data. Statistically significant, posi¬ 
tive relationships are observed among most major components of 
the PPHS and the QLI, but many of these are in the low to moder¬ 
ate range (rvalues from 0.23 to 0.62). Despite the logical align¬ 
ment between the major subscales of the PPHS and the QLI, 
closer scrutiny reveals differences in a number of content areas. 
This suggested that the QLI may not have been the most appropriate 
scale for assessing the concurrent validity of the PPHS. 

Once a preliminary list of factors was generated by exploratory 
factor analysis and validated for theoretical consistency, the next 
step was to assess the reliability of the scale structures. A Cronbach’s 
(a) test was used to assess subscale and total PPHS internal 
consistency. 

The total instrument has an alpha coefficient of r= 0.91. Alpha 
coefficients for the three major subscales are as follows: adjustment 
to a new normal (r=0.88), social supports (r=0.84) and illness 
and treatment experiences (r=0.71). The pilot study analysis 
reveals that internal consistency is high for the total scale and 
slightly lower, but within acceptable ranges, for each of the sub- 
scales. Finally, alpha coefficients for the components of each major 
subscale range from r=0.43 to 0.89. The findings indicate that the 
individual components of the illness and treatment experience, 
supports, and adjustment subscales have a fair to very good inter¬ 
nal consistency. The low reliability scores for some components 
(i.e., self-management, allied health, and family) could be attributed 
to the small number of items. 


Qualitative research methods facilitate greater understanding of 
human behavior and interactions within variant and similar con¬ 
texts. Grounded theory is a useful qualitative approach for generat¬ 
ing theory to capture basic psychosocial processes that can be 
formulated into a substantive theory. In turn, the major theoretical 
constructs of the theory can be used for scale development. Scales 
developed in this manner are more sensitive because they are firmly 
grounded in the patients’ own experiences. Useful monitoring 
tools can be developed for identifying significant changes in indi¬ 
viduals’ experiences with illness and treatment, social supports, and 
long-term adjustment, which have important implications for the 
quality of survival for patients with chronic disease in the short- 
and long term. 
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Chapter 19 


Health Economics in Clinical Research 

Braden J. Manns 

Abstract 

The pressure for health care systems to provide more resource intensive health care and newer, more 
costly, therapies is significant, despite limited health care budgets. As such, demonstration that a new 
therapy is effective is no longer sufficient to ensure that it is funded within publicly funded health care 
systems. The impact of a therapy on health care costs is also an important consideration for decision¬ 
makers who must allocate scarce resources. The clinical benefits and costs of a new therapy can be esti¬ 
mated simultaneously using economic evaluation, the strengths and limitations of which are discussed 
herein. In addition, this chapter includes discussion of the important economic outcomes that can be 
collected within a clinical trial (alongside the clinical outcome data) enabling consideration of the impact 
of the therapy on overall resource use, thus enabling performance of an economic evaluation, if the 
therapy is shown to be effective. 

Key words Economic evaluation, Cost-effectiveness, Costs, Health economics 


1 Overview 


The pressure for health care systems to provide more resource 
intensive health care and newer, more costly, therapies is sub¬ 
stantial, but health care budgets are limited. For example, 
nephrology programs caring for patients with ESRD are faced 
with numerous harsh realities: sicker patients requiring more 
medical interventions, newer and more expensive technology, 
and fixed budgets [1]. In 2011 in the USA, almost $50 billion 
was spent by all payers to care for patients with ESRD [2], and 
in economically developed countries, it has been estimated that 
nearly 3 % of overall health budgets are spent on ESRD care [1, 
3-5 ] This is despite the fact that less than a quarter of a percent 
of the Canadian and US populations have ESRD [3, 5]. Given 
these cost pressures, demonstrating that a new therapy is effec¬ 
tive is no longer sufficient to ensure its uptake in practice within 
publicly funded health care. The impact of the therapy on health 
care costs must also be considered by decision-makers who then 
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decide how scarce resources should be allocated. If considered 
during the trial planning stages, important economic outcomes 
can be collected alongside clinical outcome data in a clinical trial. 
This enables the performance of an economic evaluation which 
measures the impact of a therapy on clinical outcomes and over¬ 
all resource use. 

In this chapter, we discuss the basics of health economics, 
outlining important concepts that clinicians and other health pro¬ 
fessionals should understand. Also highlighted is the role and use 
of economic evaluations, including the various types, and a general 
overview of how economic evaluations can be performed and 
interpreted. The strengths and weaknesses of economic evaluation 
are discussed along with how to avoid common pitfalls. Finally, 
information that could be collected alongside a clinical trial or that 
is often required from clinical research studies that would enable 
performance of a full economic evaluation is highlighted. 
Throughout the chapter, examples are highlighted to facilitate 
comprehension of the basic principles of economic evaluation. 


2 Health Economics: The Basics 


Whenever the term “economics” is mentioned in the context of 
caring for patients requiring dialysis, most clinicians likely assume 
that the intent is to limit expenditure. In reality, economics is about 
the relationship between resource inputs (the labor and capital 
used in treatment) and their benefits (improvements in survival 
and quality of life) [6]. The magnitude of such costs and benefits 
and the relationships between them are analyzed by use of 
“economic evaluation.” A full economic evaluation compares the 
costs and effectiveness of all the alternative treatment strategies for 
a given health problem. 

Most clinicians are comfortable interpreting and applying 
clinical studies that compare therapies and report the relative 
occurrence of a medical outcome of interest. In truth, if there was 
no scarcity of resources in health care, then such evidence of effec¬ 
tiveness would be all that would be required in medical decision¬ 
making. As we move into the twenty-first century, there is no 
denying that rationing of health care occurs, and that scarcity of 
resources is a reality in all publicly funded health care systems, 
including the USA. 


2.1 Basic Concepts: 
Opportunity Cost 
and Efficiency 

2.1.1 Opportunity Cost 


The concept of opportunity cost, which is central to health 
economics, rests on two principles, scarcity of resources and choice 
[6]. As noted, even in societies with great wealth, there are not 
enough resources to meet all of societies’ desires, particularly in 
the face of expensive technological advancement. This brings up 
the concept of choice, where society, due to the presence of 
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scarcity, must make choices about what health programs to fund 
and which ones to forgo. It is the benefits associated with a for¬ 
gone health care program or opportunity that constitutes opportu¬ 
nity costs. In the planning of health services, the aim of economic 
evaluations has been to ensure that the benefits from health care 
programs implemented are greater than the “opportunity costs” of 
such programs. For example, in the absence of a budgetary increase, 
use of resources to establish a nocturnal hemodialysis program 
would result in less resources being available for a second program, 
such as a multidisciplinary chronic kidney disease (CKD) clinic for 
predialysis patients. Allocation of resources to the nocturnal hemo¬ 
dialysis program would only be reasonable if the health gains per 
dollar spent exceeded those of the forgone opportunity, in this case 
a multidisciplinary CKD clinic. Thus, one way to help determine 
which is the better use of resources is to estimate the resources 
used (or costs) and health outcomes (or benefits) of each compet¬ 
ing program. 

2.1.2 What Is Meant by Efficiency is about the relationship between inputs (i.e., resources) 

Efficiency? and outcomes (i.e., improvements in health). In health economics 

there are two types of efficiency, allocative and technical [6]. 
Allocative efficiency deals with the question of whether to allocate 
resources to different groups of patients with different health 
problems (i.e., should a health region direct resources towards a 
rehabilitation clinic for patients with chronic obstructive pulmo¬ 
nary disease, or to a nocturnal hemodialysis program for end-stage 
renal disease (ESRD) patients). This is in contrast to technical 
efficiency where the resources are being distributed among the 
same patient population, in an attempt to maximize health gains 
for that patient group within the budget available (i.e., for patients 
with acute renal failure, should an intensive care unit (ICU) pro¬ 
vide intermittent hemodialysis or continuous renal replacement 
therapy) [7]. Choosing the proper type of economic evaluation 
depends on understanding what type of efficiency question is to 
be addressed. 


3 The Different Types of Economic Evaluations 

There are three main types of economic evaluation that follow 
from the concepts of opportunity cost and efficiency. Which one is 
used will depend on the question being addressed. If the question 
is one of “technical efficiency” (i.e.: with a fixed amount of resource 
available , what is the most efficient way of treating people with acute 
renal failure in the ICU] intermittent hemodialysis or continuous 
renal replacement therapy ? [7]), then a cost-effectiveness or cost- 
utility study is most appropriate. If the question is one of “alloca¬ 
tive efficiency” (i.e.: an ESRD program hoping to develop a nocturnal 
hemodialysis program competes for funding with a respiratory 
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3.1 Cost- 

Effectiveness Analysis 


3.2 Cost-Utility 
Analysis 


pro^mm wishing to establish a rehabilitation program for patients 
with chronie obstructive pulmonary disease ), then a cost-utility or 
possibly cost-benefit study is more appropriate. When addressing 
technical efficiency, the same group of patients will be treated, but 
the question becomes one of “how”. Allocative efficiency ques¬ 
tions inevitably involve comparisons of different groups of patients 
(i.e., how many resources to allocate to each). All economic evalu¬ 
ations compare a specific technology (or policy) to another, 
whether it is the status quo or a comparable technology. 

Cost-effectiveness analyses assess health benefits in naturally occur¬ 
ring units such as life years gained, cases prevented, or units of 
blood pressure reduction achieved using an intervention and can 
be used to evaluate technical efficiency. Early studies examining the 
cost-effectiveness of statins (cholesterol-lowering drugs) assessed 
the cost to prevent a heart attack [8], among other outcomes. 
Studies found that the cost to prevent a heart attack averaged 
$5,000, while the cost per life year saved was $27,000. This high¬ 
lights a challenge associated with the use of cost-effectiveness anal¬ 
ysis; that is, there is no consensus as to what constitutes good value 
for money in the prevention of hearts attacks. Moreover, compar¬ 
ing the results of this study with other cost-effectiveness studies 
that report health benefits using a different metric, or for treat¬ 
ments for patients with other health conditions, is not possible. 

A subset of cost-effectiveness analysis is cost-minimization 
analysis, which can be used when clinical outcomes between treat¬ 
ment strategies in the same patient population are known to be 
equivalent. In cost-minimization analysis, the cost of the treat¬ 
ments being compared, as well as the associated costs (or savings) 
in other areas of health care (such as outpatient visits) are esti¬ 
mated, but clinical outcomes are excluded as they do not differ. 
For example, if different dialysis modalities were known to be 
equivalent in survival and quality of life, then a cost-minimization 
analysis could be used to compare the alternate modalities. 
Canadian micro-costing studies have noted that total health care 
costs of treating dialysis patients with in-center hemodialysis (hos¬ 
pital and satellite), home hemodialysis, and peritoneal dialysis is 
approximately $95,000 to $107,000, $71,000 to $90,000, and 
$56,000 per year, respectively ($2013 CAD) [9-11]. Assuming 
that the clinical outcomes associated with the different therapies 
are equal [12-15], then using cost-minimization analysis, one 
could conclude that treating eligible patients with home-based 
dialysis could save significant resources without impacting 
outcomes. 

Comparisons often need to be made between therapies for which 
clinical success may be measured in very different units. When such 
an “allocative decision” needs to be made, it is important to express 
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health benefits in an equivalent fashion. This can be done using 
either a cost-benefit analysis or cost-utility analysis. In a cost-benefit 
analysis, medical outcomes are valued in dollars, often using a 
method called “willingness-to-pay” [6]. Cost-benefit studies are 
most often used as a research tool and interested readers are 
referred elsewhere for more detail [6]. 

Cost-utility analysis, on the other hand, is commonly used by 
decision-makers to address issues of technical or allocative effi¬ 
ciency. This type of analysis is perhaps the most familiar to clini¬ 
cians since it is the basis of “QALYleague tables” [16,17]. Clinical 
outcomes are usually considered in terms of healthy years. Healthy 
years can be measured using a “utility-based” index, which incor¬ 
porates effects on both quantity and quality of life; the most widely 
used scale is the “quality-adjusted life year” (QALY). The QALY is 
determined as the product of the number of years of life gained (or 
remaining) multiplied by the “utility” or “quality” of those years 
(rated from zero—a state equivalent to death—to one—equivalent 
to a state of full health) [18]. In practice, the utility score can be 
determined using either direct or indirect measures. Direct mea¬ 
sures include the standard gamble or the time trade-off, which can 
be assessed in the relevant patient group. Although these measures 
are felt to be the most theoretically valid measures of utility, they 
are difficult to include in a study, often requiring the help of an 
administrator [18]. Alternatively, utility scores can be derived from 
indirect measures, like the Euroqol EQ-5D or the Health Utilities 
Index, which are both questionnaire-based measures that are easy 
and quick for patients to complete. As such, these measures can 
easily be incorporated into a clinical trial, along with other relevant 
measures of health-related quality of life (HRQOL), enabling esti¬ 
mation of the expected QALYs associated with a treatment strat¬ 
egy. Details on the different methods that can be used to elicit such 
utilities are available elsewhere [18]. 

An example of a cost-utility analysis is that by Cameron et al., 
where the authors sought to clarify the costs and benefits of blood 
glucose self-monitoring in patients with type 2 diabetes who do 
not use insulin [19]. A meta-analysis of randomized controlled tri¬ 
als comparing self-monitoring with no self-monitoring showed 
that HbAlC was 0.25 % (95 % Cl 0.15-0.36 %) lower in patients 
who were randomized to blood glucose self-monitoring [20]. 
Though the difference was statistically significant, it was uncertain 
whether this would translate into clinically significant health ben¬ 
efits. Moreover, the cost associated with blood glucose test strips 
was estimated to be over $370 million in Canada in 2006, more 
than 50 % of which was for patients not taking insulin [19]. An 
economic evaluation was performed to better inform decisions 
regarding prescribing and reimbursement of blood glucose test 
strips [19]. The authors began with an estimate of the effect of 
self-monitoring on HbAlC and, based on this effect, they used 
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modeling techniques to derive the difference in both diabetes-related 
end points and life expectancy that would be expected given the 
achieved difference in HbAlC. The life-years of hypothetical 
patients were then weighted by the utility associated with the vari¬ 
ous health states modeled (all based on the occurrence of diabetes- 
related complications). The primary clinical outcome of the analysis 
was QALYs and the cost per QALY gained. They found that, com¬ 
pared with no self-monitoring, self-monitoring was associated with 
a cost per QALY gained of CDN $113,643. In the patient sub¬ 
group receiving only lifestyle interventions the cost-utility ratio 
was less favorable at $292,144 per QALY. This introduces the next 
topic. What does $113,643 per QALY mean? Does that represent 
good value for money? 


4 How to Interpret the Results of Economic Evaluations 

Of course, before evaluating the results of any study, whether it is 
an economic evaluation, a clinical trial or observational research, it 
is important to determine whether the study is valid. While this is 
a key step, the details of how to determine if an economic evalua¬ 
tion is valid or not are beyond the scope of this chapter. As with 
other areas of clinical research, good practice in economic evalua¬ 
tion is guided by several published guidelines [21-24] and adher¬ 
ence to these guidelines increases the likelihood of a published 
evaluation being valid. 

When interpreting the results of an economic evaluation com¬ 
paring a new intervention with standard care, it is often helpful to 
first compare the treatments on the basis of clinical outcomes and 
costs separately (Fig. 1) [25]. As noted in Fig. 1, there are some 
situations where an intervention should clearly be introduced (i.e., 
less expensive and superior clinical outcomes (cell Al)) and others 
where the therapy should clearly not be used (i.e., more expensive 
and less effective (cell C3)). The use of economic evaluation is 
considered most important in cell Cl, where a new therapy is 
judged more effective, but is also more expensive than a compara¬ 
tor. In these situations, judgment is required as to whether the 
extra resources represent “good value for money.” Returning to 
the notion of opportunity cost, in these situations, one must con¬ 
sider whether the required expenditure could be used to support 
other unfunded treatments that would improve health to a greater 
extent, either for the same group of patients (technical efficiency) 
or another group of patients (allocative efficiency). 

In practice, it is often difficult to perform such comparisons 
between ‘‘competing interventions” since decision-makers often 
consider funding for new interventions at discrete (and different) 
points in time. In these situations, decision-makers may compare 
the cost per QALY for the intervention being considered with the 
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Relative to current care, should a new 
treatment be adopted, given evidence of: 

Declining effectiveness 


T 


1 







] = Judgment 
required 


Effectiveness 

Compared with the control treatment the experimental treatment has: 

1. Evidence of greater effectiveness 

2. Evidence of no difference in effectiveness 

3. Evidence of less effectiveness 


Cost 

Compared with the control treatment the experimental treatment has: 

A. Evidence of cost savings 

B. Evidence of no difference in costs 

C. Evidence of greater costs 


Fig. 1 Assessing the cost of effectiveness of a new therapy relative to current care (with permission from 
C. Donaldson [25]) 


cost per QALY of therapies that have been previously rejected or 
accepted for funding. Using such a strategy, it has been observed 
that most therapies with a cost per QALY below $20,000 are 
funded within Canadian Medicare [26], and that most therapies 
with a cost per QALY above $100,000 are not funded within 
Canadian publicly funded health care. For therapies with a cost per 
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QALY between $20,000 and 100,000, funding is not consistently 
provided and a range of other factors may be considered in the 
decision to fund such interventions [26, 27]. There is a growing 
literature attempting to identify what other characteristics should 
be considered. These include (1) whether QALYs gains are gained 
through an extension of life (particularly for therapies that are 
immediately life saving) or through improvements in quality of life, 
(2) the number of people eligible for treatment, (3) the age of the 
potentially treatable patients (younger versus older), (4) whether 
the treatment was for people with good or poor underlying base¬ 
line health, (5) the likelihood of the treatment being successful, 
and (6) its impact on equality of access to therapy (equity) [27, 
28]. Moreover, the cost-effectiveness thresholds appear to vary 
across countries and committees tasked with using cost-effective¬ 
ness information to make funding decisions [29]. 


5 Strengths and Weaknesses of Economic Evaluations 

In the current era of health care with constrained resources, ever- 
increasing demands, and many new and expensive therapies, deci¬ 
sions to provide new therapies within publicly funded health care 
systems can no longer be based on effectiveness data alone. The 
cost of the therapy must also be considered, and the use of eco¬ 
nomic evaluations can help determine whether new therapies are 
necessary and which ones to provide. 

Despite their potential, and their growing use, a number of 
issues with economic evaluations have been raised. In part this is 
due to their use by pharmaceutical manufacturers who perform 
such analyses to support a positive listing recommendation for 
their drug. Given that the results of such evaluations can be influ¬ 
enced by the inputs, choosing parameters that favor the compa¬ 
ny’s product is commonly undertaken, resulting in overly 
optimistic incremental cost-effectiveness ratios. Empirical research 
in support of this contention is available. For instance, a study 
comparing manufacturer-sponsored and independently commis¬ 
sioned analyses for the National Institute for Clinical Excellence in 
the UK found that the cost per QALY was significantly more 
attractive in nearly 50 % of manufacturer-sponsored submissions 
[30]. Requiring manufacturers to adhere to published guidelines 
on the conduct of economic evaluations may minimize such bias 
[21,23,24]. 

The quality of the data also limits the validity of an economic 
evaluation, including accurate assessments of the impact of the 
therapy on clinical outcomes, quality of life, and the costs associ¬ 
ated with the therapy. This commonly happens when clinical trials 
of a new intervention are performed without regard to the even¬ 
tual requirement for conducting an economic evaluation. 
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First, the impact of a new therapy on clinical end points may be 
unknown. This may occur in situations when randomized trials 
have not been conducted or are ethically impossible to perform. In 
the absence of RCTs, and hence not knowing whether a new ther¬ 
apy is effective, the performance of an economic evaluation is 
unlikely to inform decision-making since it will be unable to deter¬ 
mine whether a therapy represents good value for money. The 
impact of a new therapy on clinical end points is also unknown 
when randomized trials have been done but the outcomes studied 
have included only putative surrogate outcomes. For instance, in 
the assessment of blood glucose self-monitoring in patients with 
type 2 diabetes who do not use insulin, only the impact of this test¬ 
ing on HbAlC (a putative surrogate end point for clinical out¬ 
comes) was studied. The authors began with an estimate of the 
effect of self-monitoring on HbAlC and, based on this effect, they 
used modeling techniques to derive the difference in both diabe¬ 
tes-related end points (i.e., heart attacks, strokes, and kidney failure) 
and life expectancy that would be expected given the achieved 
difference in HbAlC. Some antidiabetic agents that lower HbAlC 
have been shown in clinical trials to be associated with improved 
clinical outcomes, while other agents that lower HbAlC have been 
shown to increase the risk of cardiovascular outcomes. Given this, 
the cost-effectiveness of blood glucose self-monitoring is sensitive 
to an assumption of clinical benefit (on heart attacks and strokes) 
associated with lowering HbAlC. 

Second, an accurate assessment of the impact of the therapy on 
quality of life may not be available, as quality of life measures may 
not have been included in the randomized trials. Lastly, the impact 
of the therapy on other health care costs may not have been mea¬ 
sured or was not measured adequately. In these situations, the 
impact of the therapy on costing outcomes must be estimated, and 
the uncertainty in these variables leads to uncertainty in the overall 
cost-effectiveness of the new therapy. When planning an RCT, 
these three limitations can all potentially be addressed by measur¬ 
ing clinical end points, the impact of the therapy on overall quality 
of life and health care costs. Although some of these methodologi¬ 
cal challenges may be present in many analyses, it is important to 
minimize their impact through careful planning and conduct of 
the analysis, as described below. 


6 How to Conduct an Economic Evaluation 

Economic evaluations can either be done alongside a clinical trial 
or by using decision analysis [31]. Perhaps the best example of an 
economic evaluation done alongside a clinical trial was that of lung 
volume reduction surgery for patients with severe emphysema [32]. 
In this study, investigators randomized patients with emphysema 
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to lung volume reduction surgery or medical care and followed 
them for 3 years, examining survival, quality of life, and health care 
costs. Investigators found that in certain predefined subgroups, 
surgery improved survival and quality of life, but increased costs 
significantly. By combining the impact of the therapy on survival 
and quality of life into QALYs, the authors were able to estimate a 
cost per QALY of US $190,000 over a 3-year time horizon 
(Table 1) [32]. 

Given the minimization of bias that occurs during the random¬ 
ization process, data collected alongside an RCT is likely to be of 
high quality, and valid. Moreover, given the timely nature of RCTs 
that are required for medication licensing, conducting an eco¬ 
nomic evaluation alongside a clinical trial also ensures that the 
results are available in a timely fashion. Moreover, if the authors of 
the RCT have collected resource use data, then an additional 
advantage is that information on cost and outcomes is available 
from the same patients, increasing the transparency and accuracy of 
the analysis. 

There are several limitations, however, including that RCTs 
are often performed under very strict conditions and the results 
may not be representative of those that would be seen under usual 
clinical conditions [31]. Moreover, RCTs may not use the appro¬ 
priate comparator, or may use surrogate (i.e., non-clinical) rather 
than clinical end points, which, as noted above, makes performance 
of a valid economic evaluation challenging. Moreover, one of the 
main problems associated with doing an economic evaluation 
alongside a clinical trial exclusively is that clinical trials typically 
have a short time horizon, rather than a longer time horizon as 
recommended by practice guidelines [31]. With specific reference 
to the example discussed above, by using only a 3-year time hori¬ 
zon, it assumes that the additional survivors who survived because 
they had lung surgery, rather than receiving medical care, died at 3 
years, when in reality, these survivors would be likely to survive for 
several more years. 

In reality, most economic evaluations are performed using 
decision analysis. Decision analysis is a systematic approach to decision¬ 
making under conditions of uncertainty, where information is 
combined from a variety of sources, including clinical trials, observational 
studies and costing studies. Using data on health outcomes and costs 
from an observational cohort of patients, a mathematical model is 
usually constructed to simulate what happens to patients with the 
condition of interest who are treated using the new treatment or 
standard of care. The ability of a new treatment to avert adverse clini¬ 
cal outcomes can then be overlaid on this model using the results from 
the relevant RCTs, and the long-term impact of this therapy on out¬ 
comes and costs, compared with standard care, can be determined. In 
the economic evaluation of lung-volume reduction surgery, investiga¬ 
tors subsequently used decision analysis to model the impact of lung 


Table 1 

Total health care costs, and QALYs gained for patients with emphysema randomized to lung-volume reduction surgery, 
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surgery on longer term clinical outcomes and costs at 10 years and 
determined that the cost per QALY was $53,000 (i.e., significantly 
more attractive than noted using a 3-year time horizon). 


7 What to Consider When Planning Your Clinical Trial? 

Since proving evidence of effectiveness in a well-conducted clinical 
trial no longer ensures the uptake of a new therapy, it is important 
to build an economic argument for the new therapy in question as 
well, enabling decision-makers to make a more informed decision 
regarding funding. The information required can be collected 
alongside randomized trials in many situations, but this must be 
planned in advance. Conducting an economic sub-study within a 
clinical trial is acknowledged as important by many agencies who 
fund large clinical trials—in fact, acquiring funding for such studies 
is more likely if an economic sub-study is proposed. 

So what is important to consider in the planning of a clinical 
trial? Firstly, determining the impact of the therapy on clinical end 
points is crucial since if only putative surrogate end points are mea¬ 
sured, the impact of the therapy on clinical end points will remain 
unknown. Where appropriate, the impact of the therapy on quality 
of life should be measured using disease specific and generic 
quality of life measures [6, 18]. As noted above, one of the 
generic quality of life measures should include a utility measure, 
which can be directly incorporated into an economic evaluation. 

Assessing the impact of the therapy on costs is also impor¬ 
tant. Sometimes, when the study is limited in duration, or sample 
size, an accurate assessment of the impact of the therapy on costs 
is not possible. Where possible though, it is important to not 
only accurately measure the cost of the new intervention (par¬ 
ticularly when the intervention is costly, complicated or requires 
hospital admission), but to also measure the impact of the ther¬ 
apy on the occurrence of costly adverse health outcomes that may 
differ between the treatment strategies. For instance, if the ther¬ 
apy is designed to prevent an adverse health outcome, such as 
arteriovenous fistula failure, or dialysis line sepsis, both of which 
may require hospital admission and/or surgery, measuring the 
cost of this complication is important [33]. The types of costs 
that should be included in an economic evaluation (and should 
therefore be measured in an RCT) have been reviewed and are 
discussed in recently released guidelines for conduct of economic 
evaluations [21]. While there are different methods of measuring 
costs, this discussion is outside the scope of this article—interested 
readers are referred elsewhere [18]. 

Finally, there may be certain variables that can be measured 
accurately within a clinical trial that will impact the economic 
attractiveness of a new therapy and that need to be considered 





Health Economics in Clinical Research 


327 


within the context of the disease and the treatment. For instance, 
the frequency or severity of adverse events may have a significant 
impact on the attractiveness of the therapy, and in certain situa¬ 
tions, adherence with the new therapy may be important. Measuring 
these alongside the trial will enhance the validity of the accompa¬ 
nying economic evaluation. 


8 How Can Health Care Professionals Use the Results of Economic Evaluations 
When Caring for Their Patients? 

Confronted with fiscal and demographic challenges, publicly 
funded health care systems face increasing pressure to constrain 
resource use without impacting health care services. Health care 
managers continually decide how to allocate scarce health care 
resources, including determining what tests and treatments will 
be made available. However, prioritizing health care resources is 
not the sole responsibility of managers and decision-makers, and 
to be successful in maximizing population health, active physician 
engagement is required. With less than one half of one percent of 
the nation’s population determining how over 10 % of the GDP 
is spent, clinical decisions are “purchasing decisions” and should 
be made within the context of competing uses of finite resources 
to ensure that the most effective interventions are available for 
the patients who are most likely to benefit. While most clinicians 
have been trained to consider only the needs of the patient in 
front of them when making recommendations and providing 
care, given fiscal realities, this position may not be sustainable in 
the long term. 

Physicians make value-based decisions about other finite 
resources such as their time and use of beds in a hospital or inten¬ 
sive care unit [34] —and need to apply this same skill set to other 
health care resources. Although health care decision-makers may 
limit access to tests or therapies that do not provide reasonable 
value for money, in many situations, the physician is the gate keeper 
to tests or treatments, which is appropriate given that they use 
complementary information to make informed decisions about 
their patients. This section is meant to help physicians incorporate 
the notion of value for money into routine clinical care. 

Faced with a patient with a health issue, a physician considers 
the work up and/or treatment needed. It is important, however, 
for the physician to first consider whether the intervention (or test) 
is truly effective. Is there evidence that it improves clinical out¬ 
comes (or only that it improves nonclinical outcomes), and was the 
comparison against standard of care, or placebo. If it was placebo, 
do you expect that the new intervention will have a significant ben¬ 
efit in comparison to what is already standard of care. This is par¬ 
ticularly true if the effectiveness of the intervention is marginal at 
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best, or has only been shown to impact nonclinical end points [35] 
(in other words, when significant clinical uncertainty exists). 
In these circumstances, physicians should not feel compelled to 
routinely offer such therapies or tests, particularly when they are 
notably more expensive than current therapy. 

Assuming the test or therapy has been shown to improve clini¬ 
cal outcomes with respect to the standard of care, it is important 
to next consider the resource implications of the new treatment. 
Are there other less expensive treatments that could be tried first? 
In many situations, there are less expensive, or generic medica¬ 
tions that many patients will respond to. If other therapies have 
been tried, consideration as to whether the medication or treat¬ 
ment provides value for money is important, and physicians need 
to begin to integrate and routinely consider cost and “value for 
money” in clinical practice. While it may be difficult to operation¬ 
alize the use of economic evaluations during every patient encoun¬ 
ter, if costs were integrated within clinical practice guidelines, 
incorporating the consideration of cost into usual physician prac¬ 
tice would become easier. In situations where physicians deal with 
this issue on a daily basis, and where cost has not been considered 
in clinical practice guidelines, then groups of physicians could 
consider developing local standards of care with which their prac¬ 
tice is consistent. 

The uncomfortable truth is that resources are limited and 
choices must be made. While physicians have an obligation to their 
patient, they must also consider their obligation to society and to 
their other patients. If the goal of a health care system is to maxi¬ 
mize the health of all of its population under the constraint of a 
fixed budget, then considering cost-effectiveness is a reasonable 
tool to help make these choices. While physicians may be reluctant 
to incorporate the consideration of cost in their daily care, their 
role in allocating scarce resources cannot be avoided. Indeed, since 
one of the roles of publicly funded health care is to maximize 
health gains within a restricted budget, considering “value for 
money” and limiting expensive therapies to those who can benefit 
most seems a reasonable and equitable approach. Modifying and 
adhering to physician-developed clinical practice guidelines that 
take cost into account could help ease the tension between a physi¬ 
cian’s clinical decision-making and health system objectives. 


9 Conclusion 


Given financial constraints and cost pressures that exist within 
publicly funded health care systems, it is not surprising that the 
demonstration of effectiveness of a new therapy is no longer 
enough to ensure that it can be used in practice within publicly 
funded health care systems. The impact of the new therapy on 
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costs is important and is considered by decision-makers who decide 
how and whether scarce resources should be invested. When plan¬ 
ning a clinical trial, important economic outcomes can be collected 
alongside the clinical data, permitting the evaluation of the impact 
of the therapy on overall resource use, and thus enabling the per¬ 
formance of an economic evaluation, if appropriate. 
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Susan Stuckless and Patrick S. Parfrey 

Abstract 

Clinical epidemiological research in genetic diseases entails assessment of phenotypes, the burden and 
etiology of disease, and the efficacy of preventive measures or treatments in populations. In all areas, the 
main focus is to describe the relationship between exposure and outcome and to determine one of the fol¬ 
lowing: prevalence, incidence, cause, prognosis, or effect of treatment. The accuracy of these conclusions 
is determined by the validity of the study. Validity is determined by addressing potential biases and possible 
confounders that may be responsible for the observed association. Therefore, it is important to understand 
the types of bias that exist and also to be able to assess their impact on the magnitude and direction of the 
observed effect. The following chapter reviews the epidemiological concepts of selection bias, information 
bias, and confounding and discusses ways in which these sources of bias can be minimized. 

Key words Genetic diseases, Epidemiology, Selection bias, Information bias, Confounding, Validity 


1 Introduction 


The scope of clinical epidemiology is broad, ranging from the 
study of the patterns and predictors of health outcomes in defined 
populations to the assessment of diagnostic and management 
options in the care of individual patients. Moreover, the discipline 
encompasses such diverse topics as the evaluation of treatment 
effectiveness, causality, assessment of screening and diagnostic 
tests, and clinical decision analysis [1]. No matter what topic you 
are addressing, there are two basic components to any epidemio¬ 
logical study: exposure and outcome. The exposure can be a risk 
factor, a prognostic factor, a diagnostic test, or a treatment, and the 
outcome is usually death or disease [2]. In inherited diseases, 
mutated genes are the risk factors which predispose to autosomal 
dominant, autosomal recessive, x-linked and complex disorders. 
Clinical Epidemiology methods are used to describe associations 
between exposures and outcomes. 

The best research design for the investigation of causal rela¬ 
tionships is the randomized clinical trial. However, it is not always 
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feasible or ethical to perform such a study and under these circum¬ 
stances, observational studies may be the best alternatives. 
Observational studies are hypothesis-testing analytic studies that 
do not require manipulation of an exposure [3]. Participants are 
simply observed over time and their exposures and outcomes are 
measured and recorded. Three main study designs are used in 
observational studies: cohort, case-control, and cross-sectional. 
While these studies cannot prove causality, they can provide strong 
evidence for and show the strength of an association between a 
disease and putative causative factors [4]. Consequently, these 
research designs are frequently used to determine the phenotype 
associated with particular genotypes [5-7] and to assess the impact 
of interventions on the outcomes of inherited diseases [8]. The 
limitations imposed by these research designs are often com¬ 
pounded by lack of power due to the reality of small sample sizes 
for some disorders. 

Epidemiologic studies have inherent limitations that preclude 
establishing causal relationships [4]. While an appropriate research 
question, or hypothesis, is the foundation of a scientific study, 
proper methodology and study design are essential to interpret 
and, ultimately, to have clinical relevance [9]. Assessing the quality 
of epidemiological studies equates to assessing their validity [10]. 
To assess whether an observed association is likely to be a true 
cause-effect relationship, you need to consider three threats to 
validity: bias, confounding, and chance [10, 11]. Bias occurs when 
there is a deviation of the results of a study from the truth and can 
be defined as “any systematic error in the design, conduct or analy¬ 
sis of a study that results in a mistaken estimate of an exposure’s 
effect on the risk of disease” [4]. Confounding is considered a 
“mixing of effects.” It occurs when the effect of exposure on the 
outcome under study is mixed with the effect of a third variable, 
called a confounder [12]. Chance can lead to imprecise results and 
are an inevitable consequence of sampling, but the effects can be 
minimized by having a study that is sufficiently large. The role of 
chance is assessed by performing statistical tests and calculating 
confidence intervals [11]. It is only after careful consideration of 
these threats to validity that inferences about causal relationships 
can be made. In inherited diseases, multiple biases frequently influ¬ 
ence the results of cohort studies, confounding may be present, 
and chance is more likely to occur because of small sample sizes. 

Evaluating the role of bias as an alternative explanation for an 
observed association, or lack of one, is vital when interpreting any 
study result. Therefore, a better understanding of the specific types 
of bias that exist and their implication for particular study design is 
essential. The remainder of this chapter will be devoted to (a) 
describing various types of bias and how they relate to the study of 
inherited diseases, (b) discussing bias in relation to specific study 
designs, and (c) reporting general methods used to minimize bias. 
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2 Types of Bias Common in Epidemiologic Studies 

It is impossible to eliminate all potential sources of error from a 
study, so it is important to assess their magnitude. There are two 
types of error associated with most forms of research: random and 
systematic. Random error is caused by variations in study samples 
arising by chance and systematic error refers to bias. Random error 
affects the precision of a study and may be minimized by using 
larger sample sizes. This may be impossible in rare genetic condi¬ 
tions. Systematic errors (bias) can affect the accuracy of a study’s 
findings and must be addressed by good study design [13]. This 
may also be difficult in inherited diseases particularly because of 
ascertainment bias, survivor bias, volunteer bias, lead-time bias, 
and so forth. Bias is not diminished by increasing sample size. 

Each of the major parts of an investigation is at risk of bias, 
including selection of subjects, performance of the maneuver, mea¬ 
surement of the outcome, data collection, data analysis, data inter¬ 
pretation, and even reporting the findings [13]. Bias has been 
classified into three general categories: selection bias, information 
bias, and confounding [1, 14, 15]. Others include a fourth cate¬ 
gory of bias referred to as intervention bias [16-18]. Types of bias 
are listed in Table 1 . 


Table 1 

Bias in clinical studies 


Selection bias 
Ascertainment bias 
Competing risks bias 
Volunteer bias 
Nonresponse bias 
Loss to follow-up bias 
Prevalence-incidence bias 
Survivor treatment selection bias 
Overmatching bias 

Information bias 
Recall bias 
Lead-time bias 
Length-time bias 
Diagnostic bias 
Will Rogers phenomenon 
Family information bias 

Intervention bias 
Compliance bias 
Proficiency bias 
Contamination bias 
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2.1 Selection Bias: 
Are the Groups Similar 
in All Important 
Respects? 


2.1.1 Ascertainment Bias 


2.1.2 Competing 
Risks Bias 


Selection bias is a distortion in the estimate of association between 
risk factor and disease that results from how the subjects are 
selected for the study [19, 20]. It occurs when there is a systematic 
difference between the characteristics of the people that are selected 
for the study and those that are not [21]. Selection biases will 
ultimately affect the applicability and usefulness of findings and 
make it impossible to generalize the results to all patients with the 
disorder of interest. Many types of biases occur in the study of 
inherited disorders, and the following are just a few of the more 
common biases that fall under the category of selection bias. 

Ascertainment bias can occur in any study design. It is introduced 
by the criteria used to select individuals and occurs when the kind 
of patients selected for study are not representative of all cases in 
the population [16]. This is especially relevant to studies which 
examine risk associated with the inheritance of a mutated gene. 
Ascertainment bias is further complicated by the tendency of fami¬ 
lies with more severe disease to be identified through hospital clinics 
(often called referral bias), rather than through population-based 
research strategies [6]. 

Example 

In Lynch syndrome families with a mismatch repair gene mutation, 
the lifetime risk of colorectal cancer has been determined using 
families that have been selected for genetic testing based on the 
Amsterdam criteria. These criteria require that CRC be present in 
three relatives, one of whom is a first degree relative of the other 
two, that at least two generations be affected, and that CRC occur 
in one of the family members before the age of 50 years. The use 
of these very restrictive criteria were helpful in the search for caus¬ 
ative genes but were bound to cause an ascertainment bias towards 
multiple case families and towards a more severe phenotype. 
Smaller families with only one or two CRC cases and families with 
other Lynch syndrome-related cancers would not be selected lead¬ 
ing to an overrepresentation of families with multiple CRC cases in 
the study sample. Lurthermore, families in which cancer occurred 
at a later age were excluded. The estimates of penetrance obtained 
in this manner would not be representative of all mutation carriers 
in the general population [5, 22]. 

Competing risks occur commonly in medical research. Often times, 
a patient may experience an event, other than the one of interest, 
which alters the risk of experiencing the actual event of interest. 
Such events are known as competing risk events [23] and may 
produce biased risk estimates. 
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2.1.3 Volunteer Bias 


2.1.4 Nonresponse Bias 


2.1.5 Loss 
to Follow-Up Bias 


Example 

Carriers of MSH2 (a mismatch repair gene) mutations are at risk of 
developing a number of different cancers, such as CRC, endome¬ 
trial cancer, uterine cancer, stomach cancer, transitional cell cancers 
of the kidney, ureter, bladder, and others. Therefore, estimates of 
risk obtained may be biased because multiple events are related to 
the genetic mutation. An individual who develops stomach cancer, 
for example, and who dies from it, will no longer be at risk for 
another type of cancer, such as CRC. Thus, when examining the 
incidence of CRC, stomach cancer would be a competing risk 
because those who die of it are no longer at risk of CRC. 

Volunteer bias is also referred to as “self-selection” bias. For ethical 
reasons, most studies allow patients to refuse participation. If those 
who volunteer for the study differ from those who refuse participa¬ 
tion, the results will be affected [9, 13]. Volunteers tend to be 
better educated, healthier, lead better lifestyles, and have fewer 
complications given similar interventions than the population as a 
whole [14]. Research into genetic-environmental interactions has 
shown the control group (those without the disease of interest) to 
have higher educational levels and higher annual income than the 
diseased group [24]. 

Example 

Those who volunteer to enter genetic screening programs may be 
healthier than those who refuse. This would lead to an incorrect 
assumption that the screening protocol favorably affects outcome. 
It may be that disease severity is responsible for the observed dif¬ 
ference, not the actual screening test. 

Nonresponse bias occurs when those who do not respond to take 
part in a study differ in important ways from those who respond 
[13, 25, 26]. This bias can work in either direction, leading to over¬ 
estimation or underestimation of the risk factor/intervention. 

Example 

Prevalence of disease is often estimated by a cross-sectional survey 
or questionnaire. If for example, you are trying to determine the 
prevalence of disease associated with a genetic mutation, family 
members may be contacted and sent a questionnaire to obtain the 
necessary information. If those who return the questionnaire differ 
from those who do not return it, then estimates of disease preva¬ 
lence will be biased. It may be that those who failed to return the 
questionnaire were sicker, therefore underestimating the true prev¬ 
alence of disease. 

Loss to follow-up bias can be seen in cohort studies and occurs 
when those who remain in the study differ from those “lost,” in 
terms of personal characteristics and outcome status [9, 14,18, 21]. 
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2.1.6 Prevalence- 
Incidence (Neyman) Bias 


2.1.7 Survivor Treatment 
Selection Bias 


2.1.8 Overmatching Bias 


When the losses/withdrawals are uneven in both the exposure and 
outcome categories, the validity of the statistical results may be 
affected [16]. 

Example 

In order to determine the effectiveness of an intervention in reduc¬ 
ing disease risk, it may also be necessary to look at the potential 
side effects of the intervention to get an accurate picture. For 
example, if patients drop out of a study because of side effects of 
the intervention, then the results will be biased. Excluding these 
patients from the analysis will result in an overestimate of the effec¬ 
tiveness of the intervention. 

Selective survival may be important in some diseases. For these 
diseases the use of prevalent instead of incident cases usually dis¬ 
torts the measure of effect [27] due to the fact that a gap in time 
occurs between exposure and selection of cases. A late look at 
those exposed early will miss fatal, mild, or resolved cases [13, 15, 
25,26]. 

Example 

If cases for a particular disease are taken from hospital wards, they 
may not be representative of the general population. For example, 
if one wanted to look at the relationship between myocardial 
infarction (MI) and snow shoveling, hospitalized patients would 
not include patients with a mild, undetected MI, or fatal cases that 
died on scene or on route to hospital. Therefore, the relationship 
between MI and snow shoveling would be underestimated. 

Survivor treatment selection bias occurs in observational studies 
when patients who live longer have more probability to receive a 
certain treatment [16]. 

Example 

Arrhythmogenic right ventricular cardiomyopathy (ARVC) is a 
cause of sudden cardiac death, usually due to tachyarrhythmias. 
Patients with tachyarrhythmias are treated with antiarrhythmic 
drugs and implantable cardioverter-defibrillator (ICD) therapy 
[8]. However, in order for patients with ARVC to receive an ICD, 
they have to live long enough to have the surgery. Therefore, 
patients who receive an ICD may differ in disease severity from 
those who died before treatment, leading to an overestimation of 
the effect of the intervention. 

Overmatching bias occurs when cases and controls are matched on 
a nonconfounding variable (associated with the exposure but not 
the disease) and can underestimate an association [16]. 
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2.2 Information Bias Information bias is also referred to as “measurement bias.” Infor¬ 
mation bias is a distortion in the estimate of association between 
risk factor and disease that is due to systematic measurement error 
or misclassification of subjects on one or more variables, either risk 
factor or disease status [19]. It occurs if data used in the study are 
inaccurate or incomplete, thus influencing the validity of the study 
conclusions [20, 21]. The effect of information bias depends on 
its type and may result in misclassification of study subjects. 
Misclassification can be “differential” if it is related to exposure or 
outcome and differs in the groups to be compared, or “nondif¬ 
ferential” if it is unrelated to exposure and outcome and is the 
same across both groups to be compared [9, 16]. The following 
biases are all considered to be a particular type of information bias. 

2.2.1 Recall Bias Recall or memory bias may be a problem if outcomes being mea¬ 

sured require that subjects (cases and controls) recall past events. 
Questions about specific exposures may be asked more frequently 
of cases than controls, and cases may be more likely to intensely search 
their memories for potential causative exposures [13, 25-27]. The 
recall of cases and controls may differ both in amount and accuracy 
and the direction of differential recall cannot always be predicted 
[20, 26]. However, in most situations, cases tend to better recall 
past exposures leading to an overestimation of the association 
between outcome and prior exposure to the risk factor [19]. This 
is particularly important in family studies of genetic disease, where 
there is usually a cross-sectional start point and phenotypic infor¬ 
mation is obtained using retrospective and prospective designs. 
Retrospective chart reviews are unlikely to contain the detailed 
clinical information that can be obtained by prospective evaluation, 
thus requiring patients to recall past events to ensure complete 
information. 

Example 

Mothers of children with birth defects/abnormalities may recall 
exposure to drugs or other toxins more readily than mothers of 
healthy born children. This may lead to an overestimation of the 
association between a particular drug/toxin and birth defects. 

2.2.2 Lead-Time Bias Lead-time bias is produced when diagnosis of a condition is made 

during its latency period, leading to a longer duration of illness 
[16] (see Fig. 1). If study patients are not all enrolled at similar, 
well-defined points in the course of their disease, differences in 
outcome over time may merely reflect this longer duration [28]. 
It falsely appears to prolong survival. 

Example 

This is particularly relevant in studies evaluating the efficacy of 
screening programs as cancer cases detected in the screened group 
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Fig. 1 Natural course of a disease and possible biases related to the timing of diagnosis. The course of a 
disease is represented as a sequence of stages, from biologic onset to a final outcome such as death. 
Disease diagnosis can be made as soon as pathologic lesions are detectable (stage #2); when initial signs 
and symptoms occur (stage #3); or later on (stage #4). Lead-time bias occurs when subjects are diagnosed 
earlier (a) than usual (b) independent of the speed of progression of the disease. If group A, for example, 
contains more subjects diagnosed in stage #2, the apparent observed benefit is due to a zero-time shift 
backward from the time of usual diagnosis leading to a longer observed duration of illness. Length-time bias 
occurs when more severe forms of the disease (c), characterized by shorter induction and/or latent periods 
and lower likelihood of early or usual diagnosis, are unbalanced by group. The apparent difference in prog¬ 
nosis is due not only to differences in disease progression {slope) but also to differences in timing of diagnosis. 
With permission from ref. 32 


will have a longer duration of illness than those diagnosed through 
routine care. For example, it may appear that cancer cases detected 
through screening have a 10-year survival as compared to a 7-year 
survival for those detected symptomatically. However, the appar¬ 
ent increase in survival may be due to the fact that the screening 
procedure was able to detect the cancer 3 years prior to the devel¬ 
opment of symptoms. Therefore, the overall survival time, from 
disease onset to death, is the same for both groups. 
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2.2.3 Length-Time Bias 


2.2.4 Diagnostic Bias 


2.2.5 Will Rogers 
Phenomenon 


Length-time bias occurs when a group of patients contains more 
severe forms of the disease characterized by shorter induction 
periods or includes patients who have progressed further along 
the natural course of the disease and are in a later phase of disease 
(see Fig. 1). 

Example 

The apparent worse prognosis in one group of ADPKD patients 
compared to another may not be due to faster progression of the 
disease but that more cases with chronic kidney disease (who are 
further on in the natural course of the disease) have been enrolled 
in one group compared to the other. 

Diagnostic bias is also referred to as surveillance bias and tends to 
inflate the measure of risk. This bias occurs when the disease being 
investigated is more likely to be detected in people who are under 
frequent medical surveillance as compared to those receiving rou¬ 
tine medical attention [25]. Screening studies are prone to this 
type of bias [9]. 

Example 

Carriers of mutations which predispose to cancer may undergo 
more intensive follow-up than those with undetected mutations 
allowing for earlier cancer detection among the former group. 

In medicine, the Will Rogers phenomenon refers to improvement 
over time in the classification of disease stages: if diagnostic sensi¬ 
tivity increases, metastases are recognized earlier so that the dis¬ 
tinction between early and late stages of cancer will improve [29]. 
This produces a stage migration from early to more advanced 
stages and an apparent higher survival [16]. 

Example 

This bias is relevant when comparing cancer survival rates across 
time or even among centers with different diagnostic capabilities. 
For example, Hospital A may have a more sensitive diagnostic test 
to detect cancer than Hospital B. Patients deemed to have early 
stage cancer at Hospital B would actually be deemed later stage 
cancer patients at Hospital A because of the sensitivity of the test at 
Hospital A to detect even earlier stage cancers. Therefore, Hospital 
A will appear to have better survival for its early stage cancer 
patients when compared to early stage cancer patients at Hospital B. 
However, the increased survival at Hospital A is due to a more 
sensitive measure which allows for better definition of an early 
stage cancer. 
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2.2.6 Family Information 
Bias 


2.3 Intervention 
(Exposure) Bias 


2.3.1 Compliance Bias 


2.3.2 Proficiency Bias 


2.3.3 Contamination Bias 


Within a family, the flow of information about exposure and 
disease is stimulated by a family member who develops the disease 
[9, 13, 25, 26]. An affected individual is more likely than an unaf¬ 
fected family member to know about the family history for a 
particular disease. 

Example 

An individual with a particular disease is more likely to recall a positive 
family history of disease than a control subject who does not have 
the disease. Therefore, risk estimates for the effect of family history 
on disease may be overestimated when obtained from a case as 
opposed to a control. 

This group of biases involves differences in how the treatment or 
intervention was carried out, or how subjects were exposed to the 
factor of interest [17, 18]. Three common intervention biases are 
compliance bias, proficiency bias, and contamination bias. 

Compliance bias occurs when differences in subject adherence to 
the planned treatment regimen or intervention affect the study 
outcomes [13, 16, 26]. 

Example 

Patients who enter clinical screening programs following genetic 
testing may not always be compliant with guidelines established for 
appropriate screening intervals. Therefore, patients who do not 
follow the protocol guidelines will tend to have worse outcomes 
than compliant patients and this will lead to an apparent decrease 
in the effectiveness of screening. 

Proficiency bias occurs when treatments or interventions are not 
administered equally to subjects [13]. This may be due to skill or 
training differences among personnel and/or differences in 
resources or procedures used at different sites [17]. 

Example 

Colorectal cancer screening protocols may differ between facilities. 
For example, one hospital may use barium enema as the screening 
procedure whereas another may use colonoscopy. Colonoscopy is 
more efficient at detecting polyps than barium enema, leading to 
better outcomes. Therefore, comparing the impact of screening 
between these two hospitals, without taking into account the dif¬ 
ferent screening procedures, would lead to a biased result. 

Contamination bias occurs when control group subjects inadver¬ 
tently receive the intervention or are exposed to extraneous treat¬ 
ments, thus potentially minimizing the difference in outcomes 
between the cases and controls [13, 16, 26]. 
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Example 

To determine the effectiveness of ICD therapy in patients with 
ARVC as opposed to ARVC patients without an ICD, drug ther¬ 
apy should also be taken into consideration. If those in the control 
group are receiving antiarrhythmic drugs and those in the inter¬ 
vention group are not, then contamination bias may exist. This 
may lower survival benefit estimates for the ICD group. 

2A Confounding: Is This is part of the section heading and should not be written here. 
an extraneous factor All associations are potentially influenced by the effects of con- 

blurring the effect? founding, which can be thought of as alternative explanations for 

an association. To be a confounder, a variable must meet the fol¬ 
lowing three characteristics: 

1. It must be a risk factor for the disease in question. 

2. It must be associated with the exposure under study, in the 
population for which the cases were derived. 

3. It must not be an intermediate step in the causal pathway 
between exposure and disease [30]. 

To accurately assess the impact of confounding, you must con¬ 
sider the size and direction of the effect modification. It is not 
merely the presence or absence of a confounder that is the prob¬ 
lem; it’s the influence of the confounder on the association that is 
important [21]. Confounding can lead to either observation of 
apparent differences between study groups when they don’t really 
exist (overestimation), or, conversely, observation of no differences 
when they do exist (underestimation) [21]. 

Age and sex are the most common confounding variables in 
health-related research. They are associated with a number of 
exposures, such as diet and smoking habits, and are also indepen¬ 
dent risk factors for most diseases [11]. Confounding cannot occur 
if potential confounders do not vary across groups [10]. For exam¬ 
ple, in a case-control study, for age and sex to be confounders, 
their representation should sufficiently differ between cases and 
controls [27]. 


3 Biases Linked to Specific Study Designs 

While some study designs are more prone to bias, its presence is 
universal. There is no ideal study design: different designs are 
appropriate in different situations and all have particular method¬ 
ological issues and constraints [1,9]. 

Selection biases relate to the design phase of an observational 
study and are the most difficult to avoid [20]. They can be mini¬ 
mized in prospective cohort studies but are problematic in retro¬ 
spective and case-control studies, because both disease outcome 
and exposure have already been ascertained at the time of participant 
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selection [9]. Bias in the choice of controls is also a major issue in 
case-control studies. Ensuring that controls are a representative 
sample of the population from where the cases came is difficult and 
can be time consuming [14]. 

Loss to follow-up bias is a major problem in prospective cohort 
studies especially if people drop out of a study for reasons that are 
related to the exposure or outcome [ 31 ]. If the duration of follow¬ 
up is shorter than the time required for a particular exposure to 
have its effect on an outcome, then risks will be underestimated 
[14,21], 

Recall bias is major issue for case-control and retrospective 
cohort studies where exposure status may be self-reported [9]. 
Subjects who know their disease status are more likely to accurately 
recall prior exposures. This bias is particularly problematic when 
the exposure of interest is rare [26]. Prevalence-incidence bias is 
also problematic for case-control studies, especially if cases are 
identified in the clinical setting because mild cases who do not 
present to clinic or those who have died at a young age are likely to 
be missed [25]. 

Cross-sectional studies are particularly vulnerable to volunteer 
bias and nonresponse bias. This type of study requires participants 
to fill out a questionnaire/survey and is likely to be biased as those 
who volunteer are unlikely to be representative of the general pop¬ 
ulation and those who respond likely differ from those who do not 
respond. 


4 Methods to Minimize the Impact of Bias and Confounding in Epidemiologic 
Studies 


Issues of bias and confounding are challenges that researchers 
face, especially in observational studies. Techniques exist to pre¬ 
vent and adjust for these biases analytically, and an understanding 
of their characteristics is vital to interpreting and applying study 
results [12]. 

4.1 Strategies The causes of bias can be related to the manner in which study 

for Dealing with Bias subjects are chosen, the method in which study variables are col¬ 

lected or measured, the attitudes or preferences of an investigator, 
and the lack of control of confounding variables [9]. The key to 
decreasing bias is to identify the possible areas that could be 
affected and to change the design accordingly [18, 21]. 

Minimizing selection biases requires careful selection of a study 
population for whom complete and accurate information is avail¬ 
able [21]. To ensure this, clear and precise definitions should be 
developed for populations, disease, exposure, cases and controls, 
inclusion and exclusion criteria, methods of recruiting the subjects 
into the study, units of measurement, and so forth. Comparison of 
baseline characteristics is important to ensure the similarity of the 



Clinical Genetic Research 1: Bias 


345 


4.2 Strategies 
for Dealing 
with Confounding 


4.2.1 Prevention 
in the Design Phase 


groups to be compared. Paying careful attention to the willingness 
of subjects to continue with the study and employing multiple 
methods of follow-up can be useful in reducing loss [30]. Some 
loss is almost inevitable, depending on the length of the study, and 
as such it can be useful to perform sensitivity analyses. Here the 
missing group are all assumed to have a good or bad outcome and 
the impact of these assumptions on the outcome is evaluated [18]. 

Minimizing information biases requires appropriate and objec¬ 
tive methods of data collection and good quality control. 
Information bias can be reduced by the use of repeated measures, 
training of the researchers, using standardized measures, and using 
more than one source of information [18]. Blinding of subjects, 
researchers, and statisticians can also reduce those biases where 
knowledge of a subject’s treatment, exposure, or case-control sta¬ 
tus may influence the results obtained [14, 20]. 

Confounding is the only type of bias that can be prevented or 
adjusted for, provided that confounding was anticipated and the 
requisite information was collected [15]. In observational studies, 
there are two principle ways to reduce confounding: 

1. Prevention in the design phase by restriction or matching 

2. Adjustment in the analysis phase by either stratification or mul¬ 
tivariate modeling [2, 10-12, 15, 21, 27] 

At the time of designing the study, one should first list all those 
factors which are likely to be confounders, and then decide how to 
deal with each in the design and/or analysis phase [30]. 

The simplest approach is restriction. Restriction occurs when admis¬ 
sion to the study is restricted to a certain category of a confounder 
[2, 21, 27]. For example, if smoking status is a potential con- 
founder, the study population may be restricted to nonsmokers. 
Although this tactic avoids confounding it leads to poorer external 
validity, as results are limited to the narrow group included in the 
study, and also, potential shrinking of the sample size [12, 15]. 

A second strategy is matching, so that for each case one or 
more controls with the same value for the confounding variable are 
selected. This allows all levels of a potential confounder to be 
included in the analysis and ensures that within each case-control 
stratum, the level of the confounder is identical [2, 21, 27]. 
A common example is to match cases and controls for age and 
gender. Matching is most commonly used in case-control studies, 
but it can be used in cohort studies as well and is very useful when 
the number of subjects in a study is small. However, when the 
number of possible confounders increases, matching cases and 
controls can be difficult, time consuming, and therefore expensive. 
A number of potential controls may have to be excluded before 
one is found with all the appropriate characteristics. Also, by definition, 
one cannot examine the effect of a matched variable [15, 21, 27]. 
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4.2.2 Adjustment 
in the Analysis Phase 


5 Conclusions 


Adjusting for confounding will improve the validity but reduce the 
precision of the estimate [20]. It is only possible to control for 
confounding in the analysis phase if data on potential confounders 
were collected during the study. Therefore, the extent to which 
confounding can be controlled for will depend on the complete¬ 
ness and accuracy of this data [11]. 

Stratification is one option commonly used to adjust for con¬ 
founding after the study has been completed. It is a technique that 
involves the evaluation of association between the exposure and 
disease within homogeneous categories (strata) of the confound¬ 
ing variable [10, 15, 21]. For example, if gender is a confounding 
variable, the association between exposure and disease can be 
determined for men and women separately. Methods are then 
available for summarizing the overall association, by producing a 
weighted average of the estimates obtained for each stratum. One 
such method is the Mantel-Haenszel procedure [21]. If the 
Mantel-Haenszel adjusted effect differs substantially from the 
crude effect, then confounding is deemed present [15]. 
Stratification is often limited by the size of the study and its ability 
to only control for a small number of factors simultaneously. As the 
number of confounders increases, the number of strata greatly 
increases and the sample size within each strata decreases. This in 
turn, may lead to inadequate statistical power [2, 21]. 

The second option is multivariate modeling. This approach 
uses advanced statistical methods of analysis that simultaneously 
adjust (control) for several variables while examining the potential 
effect of each one [2, 10, 15, 21]. In epidemiological research, the 
most common method of multivariate analysis is regression model¬ 
ing which includes linear regression, logistic regression, and Cox’s 
regression, to name a few. Linear regression is used if the outcome 
variable is continuous, logistic regression is used if the outcome vari¬ 
able is binary, and Cox’s regression is used when the outcome is 
time dependent. Multivariate modeling can control for more fac¬ 
tors than stratification but care should still be taken not to include 
too many variables. Also, the appropriateness and the fit of the 
model should be examined to ensure accurate conclusions [21]. 

Each method of control has its strengths and limitations and in 
most situations, a combination of strategies will provide the best 
solution [21]. 


In epidemiologic research, it is essential to avoid bias and to con¬ 
trol for confounding. However, bias of some degree will always be 
present in an epidemiologic study. The main concern, therefore, is 
how it relates to the validity of the study. Selection biases make it 
impossible to generalize the results to all patients with the disorder 
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Table 2 

Questions to aid in the identification of biases [14] 


• Is the study population defined? 

• Does the study population represent the target population? 

• Are the definitions of disease and exposure clear? 

• Is the case definition precise? 

• What are the inclusion or exclusion criteria? 

• Do the controls represent the population from which the cases came? 

• Could exposure status have influenced identification or selection of 
cases or controls? 

• Are the cohorts similar except for exposure status? 

• Are the measurements as objective as possible? 

• Is the study blinded as far as possible? 

• Is the follow-up adequate? 

• Is the follow-up equal for all cohorts? 

• Is the analysis appropriate? 

• Are the variable groups used in the analysis defined a priori? 

• Is the interpretation supported by the results? 


of interest, while the measurement biases influence the validity of 
the study conclusions [20]. Key questions to ask to identify biases 
when planning, executing, or reading a study are identified in 
Table 2 [14]. If answers to these questions are unsatisfactory, then 
careful consideration should be given to the quality and clinical 
relevance of the study’s results. 

The previous discussion is just an overview of some of the main 
types of biases inherent in observational research. For a more 
detailed and comprehensive list, refer to articles by Sackett [26], 
Delgado-Rodrfquez and Llorca [16], and Hartman et al. [13]. 
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Chapter 21 


Clinical Genetic Research 2: Genetic Epidemiology 
of Complex Phenotypes 

Darren D. O’Rielly and Proton Rahman 

Abstract 

Genetic factors play a substantive role in the susceptibility to common diseases. Due to recent and rapid 
advancements in characterization of genetic variants and large-scale genotyping platforms, multiple genes 
and genetic variants have now been identified for common, complex diseases. The most efficient method 
for gene identification at present appears to be large-scale association-based studies, which integrate 
genetic and epidemiological principles. As the strategy for gene identification studies has shifted towards 
genetic association-based methods rather than traditional linkage analysis, epidemiological methods are 
increasingly being integrated into genetic investigations. Consequently, the disciplines of genetics and 
epidemiology, which historically have functioned separately, have been integrated into a discipline referred 
to as genetic epidemiology. In this chapter, we review methods for establishing the genetic burden of 
complex genetic disease, followed by methods for gene and/or genetic variant identification and when 
appropriate we highlight the epidemiological issues that guide these methods. 

Key words Genetic epidemiology, Linkage studies, Association studies, Whole genome-wide association 
scans, Genotyping, Linkage disequilibrium 


1 Introduction 


Despite many advances in medicine, the basis for most common, 
complex disorders still remains elusive [1]. Complex disorders, 
assumed to be of multifactorial etiology, exhibit non-Mendelian 
inheritance and arise from an interplay of genetic, epigenetic, and 
environmental factors. Cumulative evidence implicates a substan¬ 
tive role for genetic factors in the etiology of common diseases [1]. 
Recent evidence suggests that common genetic variants will explain 
at least some of the inherited variation in susceptibility to common 
disease [2]. These variants may impact disease susceptibility, 
expression, or drug response. For most traits, there is evidence that 
rare Mendelian mutations, low-frequency segregating variants, 
and copy number variants (CNVs) may also contribute toward 
complex phenotypes. The elucidation of the genetic determinants 
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is of great importance as it can suggest mechanisms that can lead 
to a better understanding of disease pathogenesis, and improved 
prediction of disease risk, diagnosis, prognosis, and therapy. 

The contribution of genetics to complex disease is usually 
ascertained by conducting population-based genetic-epidemiolog¬ 
ical studies followed by carefully planned genetic analyses [3]. 
Specifically, these include family- and population-based epidemio¬ 
logical studies that estimate the genetic burden of disease followed 
by molecular studies that investigate candidate gene, genome-wide 
linkage, or genome-wide association studies, which set out to iden¬ 
tify the specific genetic determinants. 

As the strategy for gene identification studies has shifted towards 
genetic association methods rather than traditional linkage analysis, 
epidemiological methods are increasingly being integrated into 
genetic concepts [4]. As a result, the disciplines of genetics and epi¬ 
demiology, which historically have functioned separately, have been 
integrated into a discipline referred to as genetic epidemiology. In 
this chapter, we review methods for establishing the genetic burden 
of complex genetic disease, followed by methods for gene and/or 
genetic variant identification. We end with a discussion on genome- 
wide association analysis, the so-called “missing heritability” of 
complex disease, and finally we discuss strategies to elucidate this 
“missing heritability” in future genomic investigations. 


2 Establishing a Genetic Component of a Complex Trait 

For complex traits, it is necessary to prove claims of genetic 
determination. The obvious way to approach this is to show that 
the trait segregates in families. Segregation analysis is the main 
statistical tool for analyzing the inheritance of any trait, and it can 
provide evidence for or against a major susceptibility locus. 

Complex genetic diseases are considered to be heritable as they 
clearly cluster in families. However, these traits do not demonstrate 
a clear Mendelian pattern of inheritance (i.e., dominant or reces¬ 
sive). The strength of the genetic effect in any particular disease 
varies significantly and is not necessarily intuitive, as familial clus¬ 
tering, from a clinical perspective, occurs more frequently in highly 
prevalent diseases. If the contribution of the genetic variation is 
small, then a gene identification project becomes very difficult if 
not impossible, as large sample sizes will be required in order to 
identify genetic factors. As a result, it is important to carefully 
examine all epidemiological data to determine the genetic burden 
of a complex trait or disease. 

One of the most compelling methods to implicate genetics in 
complex disease is through the use of twin studies [5]. Twin stud¬ 
ies typically estimate the heritability of a disorder, which refers to 
the proportion of variability of a trait attributed to genetic factors. 
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An increased disease concordance between monozygotic twins as 
compared to dizygotic twins, strongly implicates genetic factors. 
This is indeed the case in psoriasis, where there is a threefold 
increased risk of psoriasis in identical twins as compared with 
fraternal twins [6, 7]. However, as the concordance for psoriasis 
among monozygotic twins is 35 %, this suggests that environmental 
factors also play an important role in disease susceptibility. 
Importantly, genetic epidemiologists must be mindful that parents 
give their children their environment as well as their genes and 
consequently, many traits segregate in families because of shared 
family environment. 

Evidence for genetic variation of a complex trait can also be 
obtained from population-based studies by estimating the risk 
ratio. The magnitude of genetic contribution is estimated by assess¬ 
ing the relative proportion of disease in siblings (or other relatives) 
as compared with the prevalence of disease in the general popula¬ 
tion. This parameter, originally formulated by Risch, is denoted as 
7 r , where “R” represents the degree of relatedness, with higher X 
values indicating a greater genetic effect [8]. By convention, any X 
value greater than two is generally considered to indicate a signifi¬ 
cant genetic component. It is important to acknowledge however 
that X K can be influenced by shared environmental factors and thus 
cannot be solely attributed to genetic factors [8]. Moreover, the 
prevalence of a particular trait in relatives and the general popula¬ 
tion can affect its size. Therefore, a strong genetic effect in a very 
common disease will yield a smaller X value, compared with an 
equally strong genetic effect in a rare disease. Consequently, the X 
value is not an ideal measure for highly prevalent complex diseases 
as it can underestimate the genetic component [8]. 


3 Determining the Mode of Inheritance 

Most biological traits of interest to humans have a multifactorial 
pattern of inheritance in that they are determined by many muta¬ 
tions at multiple loci, as well as by many nongenetic factors [8]. 
Some traits demonstrate classical Mendelian patterns of inheritance 
and segregate within families [9-12]. For example, there are 
reports that propose an autosomal dominant inheritance pattern and 
others that suggest an autosomal recessive pattern of inheritance 
for psoriasis [10]. 

Formal segregation analysis has historically been used as the 
method for identifying the presence of a major genetic effect. 
However, due to the expense and time required, segregation analy¬ 
sis is now routinely overlooked. Risch has developed criteria for 
using risk ratios among relatives of differing relatedness to obtain 
information about genetic models [8]. When the risk ratio (X K - 1) 
decreases by a factor of greater than 2 between the first and second 



352 


Darren D. O’Rielly and Proton Rahman 


degrees of relatedness, the data are consistent with a multi-locus 
model [8]. 

Increasingly, alternative non-Mendelian inheritance patterns 
are now being proposed for complex disorders. These include trip¬ 
let expansion mutations (anticipation), genomic imprinting, and 
mitochondrial-related inheritance [11]. Anticipation is character¬ 
ized by a dynamic trinucleotide repeat sequence mutation and is 
associated with an increase in severity and decrease in age of onset 
in successive generations [11]. Myotonic dystrophy is a classic 
example of a trait that demonstrates anticipation [12]. 

Genomic imprinting refers to an epigenetic effect that causes 
differential expression of a gene depending on the sex of the trans¬ 
mitting parent [13]. The imprinting process dictates that the gene 
can be transmitted from a parent of either sex and is expressed only 
when inherited from one particular sex. Imprinting is a normal 
development process that regulates gene expression and is thought 
to affect a limited number of genes [13]. The first human disorder 
recognized to be a consequence of genomic imprinting is the 
Prader-Willi syndrome [14]. This phenomenon has also been 
reported in autism and multiple autoimmune diseases including 
psoriatic arthritis, where the proportion of probands with an 
affected father (0.65) is significantly greater than the expected 
proportion of 0.5 [15-17]. 

Mitochondria contain their own genome and mitochondrial 
DNA controls a number of protein components of the respiratory 
chain and oxidative phosphorylation system. Since mitochondria 
are transmitted maternally, a pattern of matrilineal transmission of 
a disorder is suggestive of mitochondrial inheritance. The stron¬ 
gest evidence for a mitochondrial DNA mutation in a genetic dis¬ 
ease is in Leber’s optic atrophy, which is characterized by late-onset 
bilateral loss of central vision and cardiac dysrhythmias [18]. 
Mitochondrial DNA (mtDNA) mutations have been linked to 
several complex diseases including Alzheimer’s disease and 
Parkinson’s disease [19-22]. 


4 Strategies for Gene Identification 

Once the genetic burden of a trait has been confirmed, attention is 
then directed to identifying the specific genetic factors underpinning 
disease pathogenesis. Two commonly used strategies for genetic 
identification are linkage and association analyses. Linkage methods 
are based on a positional cloning strategy, where the disease gene is 
isolated by its chromosomal location without any knowledge of the 
function of the gene. Association analysis, on the other hand, assesses 
the relationship between a particular allele, genotype or haplotype 
in a gene and a given trait or disease. The underlying assumption 
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4.1 Linkage Studies 


4.1.1 Design of Linkage 
Studies 


Parametric Linkage 
Analysis 


in association studies is that the causative allele is relatively more 
common among individuals with a given trait as compared to a 
healthy control population. While both methods are meritorious in 
gene identification, careful consideration of the benefits and limita¬ 
tions of each method must be considered when selecting a method 
for gene identification. 

Linkage methods were initially used for identification of suscepti¬ 
bility determinants across the entire genome. Positional cloning 
has been very successful for identifying disease-related genes for 
multiple single gene disorders and select complex disorders. There 
are three basic steps involved in positional cloning. The first is 
localization of the relevant genes to candidate regions where sus¬ 
ceptibility loci are suspected to reside. The next step involves isola¬ 
tion of the candidate genes. The third step involves demonstration 
of gene mutations within the suspected disease genes, thus proving 
that the candidate gene is indeed the true susceptibility gene. 
Genetic epidemiologists are intimately involved with the first two 
steps, whereas the third step, functional characterization of a gene, 
is usually carried out by a molecular biologist. 

The immediate appeal of linkage studies is the ability to iden¬ 
tify novel genes which may not have been initially considered as 
potential targets. The initial step of positional cloning requires col¬ 
lection of families with multiple affected individuals so that linkage 
analysis can be performed. 

There are two established approaches for linkage analysis. A 
recombinant-based method (also referred to as the traditional or 
parametric method) and an allele sharing method (also referred to 
as the model independent or noparametric method) [1, 3]. 

The recombinant-based method has been very successful in 
elucidating the genetic basis of Mendelian disorders where the 
mode of transmission is known. However, application of this 
method to determine susceptibility loci for complex traits has 
proven much more difficult. This method is based on the biologi¬ 
cal phenomenon of crossing over, termed “recombination.” 
Crossovers take place randomly along a chromosome and the 
closer two loci are to one another, the less likely a crossover event 
will occur. The recombination fraction is denoted as (0) for a given 
pedigree, and is defined by the probability that a gamete is recom¬ 
binant. The recombination fraction is a function of distance, when 
certain assumptions (e.g., mapping functions) are considered. 
Unfortunately, given that the number of recombination events 
cannot be directly counted, the recombination fraction must be 
estimated. This is performed by the maximum likelihood method 
based on a likelihood ratio statistic, which requires extensive com¬ 
putations and is performed using specialized statistical packages. 
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Estimates of the recombination fraction, along with the linkage 
statistic are central parameters in the traditional linkage method. 

The ideal setting for the traditional linkage approach involves 
ascertaining multigenerational families with multiple affected 
members. All family members are genotyped to determine the 
number of recombinants. Subsequently, various assumptions are 
made regarding the mode of inheritance of the disease, frequency 
of disease alleles, disease penetrance as well as other prespecified 
parameters. Based on these assumptions and the pedigree structure 
of the family, a likelihood function for the observed data is con¬ 
structed. This likelihood is a function of the recombination fraction. 
Using this likelihood function, a test of the hypothesis of no linkage 
between a marker and disease can be performed. This is based on the 
likelihood ratio statistic: 

Ho: 0 = 0.5 (no linkage) Ha: 0<O.5 

Then “Z” is equivalent to Z=log \L(0)/L(0 = 0.5)] where 
L{6) is the likelihood function. The null hypothesis of no linkage 
is rejected if the value of Z maximized over 6 (0, 0.5) is greater 
than 3.0. 

Numerous problems occur when using this method for com¬ 
plex diseases as the following specifications need to be estimated 
for the traditional method: disease gene frequency, mode of trans¬ 
mission, penetrance of disease genes, phenocopy rate, and marker 
allele frequency [9]. Estimation of the recombination fraction is 
also sensitive to pedigree structure and certainty of the diagnosis. 
Model misspecification can result in a biased estimate of the recom¬ 
bination fraction leading to detrimental effects regarding the 
power to detect linkage. 

Nonparametric Linkage An alternative approach to linkage studies for complex traits is the 
Analysis allele sharing or nonparametric method [23]. This refers to a set of 

methods which are based on the following premise: in the presence 
of linkage between a marker and disease, sets of relatives who share 
the same disease status will be more similar at the marker locus 
than one would expect if the two loci were segregating indepen¬ 
dently. The similarity at the marker locus is measured by counting 
the number of alleles shared, or identical by descent (IBD), in two 
relatives. Alleles are considered IBD if they are descendants of the 
same ancestral lines. For example, the expected frequency of IBD 
sharing of two alleles for mating of heterozygous parents is 25, 50, 
and 25 % for 0, 1, and 2 alleles, respectively. The closer the marker 
locus is to the disease locus, the more likely these proportions will 
be skewed. 

Linkage is stated to occur if there is a significant distortion of 
these proportions. The recombinant-based method is always more 
powerful than the allele sharing method to identify susceptibility 
regions when the correct model is specified. However, these models 
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are often mis-specified. In the allele sharing method, penetrance is 
not a confounder as it only includes siblings that share the same 
status. However, this method may lack power as compared with 
the traditional method. In order to overcome this limitation, a 
larger number of affected families are required in the allele sharing 
method as compared with the traditional method. In other cases, 
there is little to distinguish between the recombinant-based and 
allele sharing method, even though the allele sharing method is 
currently favored among statistical geneticists for linkage-based 
studies. 

A successful genome-wide scan will result in the identification 
of candidate region(s) where the susceptibility gene(s) is suspected 
to reside. Genome-wide linkage studies initially used microsatellite 
markers with an average spacing of about 10 cM. It has largely 
been superseded by the use of single nucleotide markers (SNPs). 
The advantage of microsatellite markers is that they are more poly¬ 
morphic than diallelic SNPs and consequently are more informa¬ 
tive. However, given that there are many more SNPs than 
microsatellites, even with just 10,000 SNPs covering the genome 
(an approximate marker density of 0.3 cM), the information con¬ 
tent for SNP genotyping is greater than microsatellite mapping. 
Currently, microarrays consisting of over a million SNPs, essen¬ 
tially blanketing the genome, are commercially available for linkage 
studies within families. 

In recombinant-based linkage analysis, standard practice is to sum¬ 
marize the results of a linkage analysis in the form of a LOD score 
function (log to the base 10 of the likelihood function). The results 
from the allele-sharing method may be summarized using various 
methods (e.g., Rvalue, chi-square, Z value), but can be transformed 
into a LOD score as well. The threshold for the traditional recom¬ 
binant based method was set by Morton in 1955 [24] at a LOD 
score of 3 for simple Mendelian traits. In 1995, Lander and 
Kruglyak [25], considering genetic model constraints, suggested 
that the LOD score threshold be raised to achieve the genome¬ 
wide significance level of 5 %. They proposed that significant link¬ 
age be reported if the LOD score reaches at least 3.3-3.8 (p value 
0.000045-0.000013, respectively), depending on the pedigree 
structure. Suggestive linkage can be reported if the LOD score is 
at least 1.9-2.4 (p value 0.0017-0.00042, respectively), depend¬ 
ing on the pedigree structure. 

Overall, replication studies for complex disease have been very 
disappointing in confirming the initial linkage. Commonly cited 
reasons for this include the possibility that the results of the initial 
findings were false positive or that the disease exhibits genetic 
(locus) heterogeneity [26]. It should be acknowledged that failure 
to replicate does not necessarily disprove a hypothesis unless the 
study is adequately powered. Therefore, replication studies should 


356 


Darren D. O’Rielly and Proton Rahman 


4.1.3 Challenges 

and Limitations of Linkage 

Studies 


clearly state the power of the study to detect the proposed effect. 
Since replication studies involve testing an established prior 
hypothesis, the issues regarding multiple testing in a genome scan 
may be overlooked. As a result, a point-wise comparison p value 
of 0.01 is sufficient to declare confirmation of linkage at the 5 % 
level [25]. 

The number of false-positive results increases with the number 
of tests performed, which poses a problem for genome-wide link¬ 
age studies. Bonferroni’s correction, which is used extensively in 
epidemiological studies, is too conservative for genetic studies. As 
a result, two stipulations account for multiple testing prior to 
establishing definite linkage: a high level of significance, and a rep¬ 
lication study using an independent sample cohort. 

The power of a hypothesis test is the probability of correctly 
rejecting the null hypothesis given that the alternate hypothesis is 
true. The following factors may influence the power to detect link¬ 
age in a complex disease: (1) the strength of genetic contribution; 

(2) the presence of epistasis or locus heterogeneity between genes; 

(3) the recombination fraction between the disease gene and 
marker locus; (4) the heterozygosity markers used; (5) the rela¬ 
tionships of the relatives studied; and (6) the number of families or 
affected relative pairs available [3, 9, 26]. The calculation of sample 
size is not straightforward in linkage studies. Although simulation 
studies have estimated the number of sibling pairs to detect linkage 
for various degrees of genetic contribution to a disease, these are 
based on the assumption that the markers are fully informative and 
tightly linked (i.e., no recombination) to the disease locus [27]. 

Difficulties encountered in linkage analysis of complex traits 
include: incomplete penetrance (i.e., phenotype is variably 
expressed despite having the genotype); phenocopies (i.e., disease 
phenotype results from causes other than the gene being mapped); 
genetic (locus) heterogeneity (i.e., when two or more loci can 
independently cause disease); epistasis (i.e., interacting genotypes 
at two or more unlinked loci); gene-environment interactions 
(i.e., where environmental factors can also contribute to disease 
susceptibility); and insufficient recruitment of family members [3]. 

The ability to successfully identify a susceptibility region using 
the positional cloning approach, in part, depends on how well one is 
able to overcome these limitations. Phenocopies can be reduced by 
strict adherence to diagnostic criteria and with minimal inclusion of 
patients with atypical features. Incomplete penetrance can be over¬ 
come by using the allele sharing method as all relatives analyzed in 
this method express the phenotype. Sufficient number of families 
can be ascertained through extensive international collaborations 
[3]. Gene-gene interactions (i.e., epistasis) and gene-environment 
interactions are present challenges that have been difficult to address 
in the identification of the susceptibility genes. 
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Genetic heterogeneity is a serious obstacle to overcome in linkage 
studies. Due to a wide spectrum of clinical features within a dis¬ 
ease, at times, evidence for linkage to a locus in one family may be 
offset by evidence against linkage in another family Even a modest 
degree of heterogeneity may significantly weaken the evidence for 
genetic linkage [9]. A potential solution to limit genetic heteroge¬ 
neity involves transforming the complex disease into a more 
homogenous subset. This can be done by splitting the disorder 
based on the phenotype, studying a single large pedigree, focusing 
on one ethnic group, or by limiting the study to a single geo¬ 
graphic region. Further characterization of the phenotypes usually 
increases the recurrence risk for relatives and reduces the number 
of contributing loci [26]. These measures may also enhance the 
identification of a subset of families that show a Mendelian pattern 
of disease transmission, allowing more powerful model-based 
methods to be used in the linkage analysis. Unfortunately, there is 
no uniformly accepted method for correctly splitting a disease into 
the various genetic forms. 

The successful cloning of the APC gene for colon cancer on 
chromosome 5 was found when the phenotype was restricted to 
extreme polyposis [28]. In this case an apparent complex trait was 
narrowed to a simple autosomal one. Early age of onset has also 
been used as a method to limit genetic heterogeneity. For instance, 
in a subset of non-insulin dependent diabetes (NIDDM) patients 
with an earlier age of onset (as reviewed in [3]), segregation analysis 
revealed an autosomal dominant transmission pattern. Subsequent 
studies revealed that a mutation in the glucokinase gene accounts 
for 50 % of cases of NIDDM in the young. Susceptibility genes for 
breast cancer and Alzheimer’s have also been identified by stratify¬ 
ing patients according to the age of onset of the clinical disorder. 
Studying families with multiple affected members or “high genetic 
load” can limit heterogeneity such was the case with hereditary 
nonpolyposis colon cancer (HNPCC), which was mapped by 
selecting probands with two other affected relatives (as reviewed 
in [3]). Finally, disease severity has been exploited in genetic 
studies, focusing on extreme ends of a trait, whether it be mild 
or severe. This approach works especially well for continuous 
traits [9, 26]. 

Association is not a specifically genetic phenomenon; rather it is 
simply a statistical statement about the co-occurrence of allele or 
phenotypes. In principle, linkage and association are totally differ¬ 
ent phenomena. Finkage is a relation between loci, but association 
is a relation between alleles or phenotypes. Finkage is a specifically 
genetic relationship, while association is simply a statistical obser¬ 
vation that might have various causes. Finkage creates associations 
within families, but not among unrelated people. 


358 


Darren D. O’Rielly and Proton Rahman 


4.2.1 Design 
of Association-Based 
Studies 


Association-based studies have gained much popularity and 
have become the method of choice for identification of genetic 
variants for complex diseases. Association studies are easier to con¬ 
duct than linkage analysis because no multigenerational families or 
special family structures are required. Moreover, association-based 
methods are more efficient for identifying common variants for 
modest to weak genetic effects, which is the typical effect size for a 
complex trait. Association studies have also immensely benefited 
from the characterization of a large number of SNP markers, link¬ 
age disequilibrium (LD) data from the HapMap project, and the 
development of high throughput genotyping technologies. 
Although association-based genetic studies have rapidly increased 
in number, it is important to appreciate the assumptions, results, 
and challenges using this approach. 

Association studies can either be direct or indirect. For direct 
association studies, the marker itself plays a causative role. The 
topology of the SNP is helpful for selecting variants, as those that 
alter function through non-synonymous protein-coding changes, 
or through effects on transcription or translation are more valuable. 
In an indirect association, the association is due to a marker locus 
being in close proximity to a causative locus, so that there is very 
little recombination between them. The specific gene or region of 
interest is usually selected on a positional basis. Haplotypes are 
important in the indirect approach, because regions within some 
markers are non-randomly associated. Accordingly, SNPs are priori¬ 
tized not only on their potential function but also with respect to 
their position. Tag-SNPs serve as efficient markers as they are within 
blocks displaying strong LD. Haplotype length and allele frequen¬ 
cies from different populations are available from public databases 
such as the International HapMap Project [29]. 

The candidate gene approach focuses on associations between 
genetic variation within prespecified genes of interest and pheno¬ 
types or disease states. Candidate genes are most often selected for 
study based on a priori knowledge of the functional impact of the 
gene on the trait or disease in question. This approach usually uses 
the case-control study design and once investigators have selected 
a candidate gene, they must decide which polymorphism would be 
most useful for testing in an association study. Candidate gene 
association studies are better suited for detecting genes underlying 
common complex diseases where the risk associated with any given 
candidate gene is relatively small [30]. The major difficulty with 
this approach is that in order to choose a potential candidate gene, 
researchers must already have an understanding of the mechanisms 
underlying disease pathophysiology. 

When genotypes are determined at SNPs throughout the 
genome of each individual, the study is called a genome-wide asso¬ 
ciation study (GWAS) [31]. The most common approach of GWAS 
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studies is the case-control setup which compares two large groups 
of individuals, one healthy control group and one case group 
affected by a disease. Individuals in each group are genotyped for 
the majority of common known SNPs with the exact number 
depending on the genotyping technology [32]. For each SNP, it is 
then investigated if the allele frequency is significantly altered 
between the case and the control group [33]. 

A driving assumption for GWAS is that common diseases are 
likely caused by common variants [34, 35]. Because the pheno¬ 
typic effect of any one variant is expected to be small, these alleles 
may reach sufficiently high frequencies to be considered common 
(at least 5 %). The HapMap Project has allowed the development 
of high-throughput approaches to ascertain the genotype for indi¬ 
viduals at 1 million SNPs across the genome, giving a good resolu¬ 
tion for GWAS. Since their debut in 2005, GWAS have identified 
thousands of SNPs associated with hundreds of different complex 
traits and phenotypes [32]. 

Conducting a GWAS for complex traits has been a formidable 
challenge because the contribution of any one locus to the pheno¬ 
type is expected to be small compared with the sizable effects of 
variants causing monogenic disorders. Furthermore, the mapping 
experiments need to cover the entire human genome at a suffi¬ 
ciently high resolution for discovery. Of course, the fact that the 
diseases are common means that large cohorts of individuals can be 
recruited for case-control studies, with thousands of affected and 
non-affected persons enrolled in a study, thus providing substantial 
power. 

Regardless of whether the study is a candidate gene or genome- 
wide association approach, key elements of any association-based 
study are selection of the disease trait, identification of the cases 
and controls, selection of markers and genotyping platform, and 
finally the genetic analysis and interpretation. 

As in linkage studies, the phenotype is critical for association-based 
genetic studies. Before initiating a genetic association study, it is 
important to verify the genetic burden of the disease. Similar to link¬ 
age studies, most of the traits for association studies are dichotomous, 
from a clinical perspective, as this best reflects a disease versus a 
healthy state. However, there are advantages to studying quantitative 
traits, especially ones that demonstrate high heritability. In general, 
quantitative traits are measured more accurately and retain substan¬ 
tially more power than qualitative traits. An important consideration 
for quantitative traits is the distribution of the trait, as most gene 
mapping methods assume a normal distribution. 

Endophenotypes, which are broadly defined as the separable 
features of a disease, are very helpful for gene mapping initiatives. 
An endophenotype-based approach has the potential to enhance 
the genetic dissection of complex diseases because endophenotypes 
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may be a result of fewer genes than the disease entity, and may be 
more prevalent and more penetrant [36]. They suggest that the 
endophenotype should be heritable, be primarily state-independent 
(i.e., manifest in an individual whether or not illness is active), co- 
segregate with illness within families, and be found in non-affected 
family members at a higher rate than in the general population 
[36]. Unfortunately, the identification of such phenotypes has often 
been elusive for many complex diseases. 

The most cost-efficient method for association-based studies is the 
traditional case-control design, where unrelated probands are 
compared with unaffected, unrelated controls [37]. Most of the 
cases for association-based studies are ascertained from disease reg¬ 
istries or are retrieved from hospital or clinic visits. The unrelated 
controls are usually healthy individuals from the community or 
hospital-based controls. As population stratification can hamper the 
results of a genetic association study, attempts are made to match 
the ethnicity of cases and controls. Appropriately ascertained con¬ 
trols are essential to case-control studies and this is accomplished 
by matching the ancestry of cases and controls (up to the level of 
the grandparents if possible). While some investigators feel that this 
is sufficient, significant stratification can still occur, especially in 
populations with historical admixture such as African Americans. 
As summarized in an editorial from Ehm et al. [38], understanding 
of the criteria used to select the case and control samples should be 
clearly articulated. It is important to indicate the geographic loca¬ 
tion, population- or clinic-based selection, and source of controls. 
Other designs for association-based methodologies include family- 
based case-control (e.g., parent offspring trios), multiplex case- 
control (i.e., multiple cases from one pedigree), or prospective 
cohort designs. 

Of the family association-based methods, the trio design 
(i.e., two parents and an offspring) is the most popular. Importantly, 
there is no issue of population stratification using this method and 
the analysis of trios is conducted via a transmission disequilibrium 
test, where the non-transmitted parental alleles are used as control 
alleles [9, 26]. The use of trios is not without limitations; there 
may be difficulties in recruiting parents, particularly for late-onset 
diseases, and there is some inefficiency compared with the case- 
control design due to the genotyping of more individuals. 

Single nucleotide polymorphisms (SNPs) are the marker of choice 
for association-based studies due to being sufficiently numerous 
to define LD blocks (unlike microsatellites) and SNPs are less 
mutable than microsatellites. SNP markers are selected using vari¬ 
ous criteria including: the potential function of the SNP, extent 
of LD of the SNP (i.e., tag-SNPs), and technological consider¬ 
ations (i.e., ability to be multiplexed in a single reaction) [2]. 
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Huizinga et al. recommended that authors specify the quality 
measures used for genotyping analysis, in order to minimize geno- 
typing errors and allow the results to be more easily replicated with 
similar genotyping protocols [39]. The quality measures include: 
internal validation, blinding of laboratory personnel of the affected 
status, procedures for establishing duplicates, quality control from 
blind duplicates, and blind or automated data entry. Assurance 
regarding the satisfaction of the Hardy-Weinberg equilibrium is 
also important. Hardy-Weinberg or genetic equilibrium is a basic 
principle of population genetics, where there is a specific relation¬ 
ship between the frequencies of alleles and the genotype of a pop¬ 
ulation. This equilibrium remains constant as long as there is 
random mating and absence of selection pressures associated with 
the alleles. 

Due to the rapid advances in marker selection and genotyping 
technologies, SNP genotyping for association-based studies has 
improved remarkably over the last decade. These advancements 
have identified the genetic variants responsible for the genetic 
component of phenotype directly via GWAS, which have revolu¬ 
tionized the identification of genomic regions associated with 
complex diseases. The identification of -2,000 robust associations 
has been made in more than 300 complex diseases and traits [40]. 
These numbers are orders of magnitude greater than those of rep¬ 
licable linkage and candidate gene association findings to date for 
complex diseases. 

Caution must be used when interpreting the results of a genetic 
association study. This notion is best exemplified by the fact that so 
few genetic association studies are replicated. In a review of 166 
putative associations that had been studied three times or more, 
only six were reproduced at least 75 % of the time [41]. The major 
reasons for the lack of replication include false-positive associa¬ 
tions, false-negative associations, and a true association that is 
population-specific (i.e., relatively rare). The most robust results 
from an association study will have a significance that is maintained 
after careful scrutiny for population stratification and multiple test¬ 
ing issues. The results should be replicated independently in 
another separate population. A replicated association may be as a 
result of a SNP being the causative mutation, or being in LD with 
the true disease allele. Finally, the results should be supported by 
functional data, if the variant is predicted to be causative. 

The greatest problem with association studies is the high rate 
of false-positive associations. The use of a ^-value below 0.05 as a 
criterion for declaring success in association-based studies is not 
appropriate. Multiple testing must be taken into account due to 
the large number of markers being analyzed, but, the traditional 
Bonferroni’s correction that is often used in epidemiological studies, 
is far too harsh for genetic epidemiology [34]. This is because the 
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selected markers are often in LD (thus not independent) and some 
of the characteristics of the disease entity are closely correlated. 
Permutation tests are increasingly used to correct ^-values for 
multiple testing. This process generates a distribution of the best 
p -value expected in the entire experiment under the null hypothe¬ 
sis of no association between genotype and phenotype. Bayesian 
methods have also been proposed that take into account pretest 
estimates of likelihood but this approach is often difficult to 
interpret. 

Population stratification can result in false positives. Controlling 
for ancestry as noted above is one method to minimize this effect. 
Methods have also been proposed to detect and control popula¬ 
tion stratification by genotyping dozens of random unlinked mark¬ 
ers (as reviewed in [34]). Ideally, these markers should be able to 
distinguish between the subpopulations. Most studies that have 
attempted to use this genomic control have themselves been 
underpowered and the ability to detect modest stratification has 
been limited. The impact on population stratification after match¬ 
ing for ethnicity solely based on a family history is presently being 
debated, and needs to be better elucidated. A false-positive result 
can also occur if the cases and controls are genotyped separately, 
especially if the genotyping of the cases and controls are done at 
different centers. 

False-negative results can arise from an inadequate sample 
size to detect a true difference. When reporting a negative repli¬ 
cation study, attempts must be made to assess the power of the 
study to exclude a genetic effect of a certain magnitude. 
Occasionally, true heterogeneity will exist within populations; 
however, this is relatively rare. 

Due to the difficulties in interpreting genetic association stud¬ 
ies, Freimer and Sabatti [42] published suggested guidelines for 
future submissions to Human Molecular Genetics. For candidate 
gene analysis, they suggest that a distinction be made between a 
candidate with previous statistical evidence and those proposed 
solely on biological hypotheses. Investigators should also specify 
the ^-value, the phenotypes, and the genomic region, and quanti¬ 
tative estimate of the prior probability. There must be a rational 
attempt to account for multiple testing and in a conservative, least 
favorable scenario, it has been proposed that the ^-value be less 
than 1(F 7 , to declare an association. For higher ^-values, the 
authors should justify why the association is noteworthy. 

4.2.3 Challenges 

of Association-Based 

Studies 


A surprising finding which was evident from some of the earliest 
GWAS investigations for complex diseases was small associated 
odds ratios. Moreover, the total fraction of the phenotypic variation 
explained for most phenotypes remains small (often 10 % or less) 
relative to the published heritability estimates, which are estimated 
using the trait covariance among relatives (i.e., familial clustering) 
[43-45]. 
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Of particular interest is the distribution of causal variants along 
the genome, their number, and their frequency spectrum. GWAS 
are particularly suited to capture common variants (i.e., Minor 
Allele Frequency >5 %) and depends on the common disease com¬ 
mon variant model. The “missing heritability” is the proportion of 
the estimated heritability of a complex disease that is not presently 
accounted for by known genetic variants [46, 47]. For many com¬ 
plex traits and diseases, it appears that one half to one third of the 
genetic variance is not tagged by current and past SNP chips [44, 
48]. These findings suggest that many lower-frequency variants are 
also needed to explain the genetic variance that is not tagged by SNP 
chips. In Fisher’s infinitesimal model there are expected to be a large 
number of rare variants (i.e., <1 % MAF) associated with disease. 
The rare-allele model proposes that rare variants of large effect 
account for a significant fraction of phenotypic variation [45, 49]. 
The combined contribution of multiple rare loci to the population- 
level genetic variance remains an open question because association 
studies that focus on rare variants remain underpowered. 

Many explanations for the sources of “missing heritability” have 
been proposed including imperfect SNP-tagging (producing weak 
GWAS signals), structural variations, gene-environment interac¬ 
tions, epigenetics (e.g., methylation analysis), epistatic interactions, 
parent-of-origin effects, phenotype misclassification, exclusion of 
the mitochondrial genome, and errors in narrow-sense heritability 
estimates [43, 44, 50-52]. This “missing heritability,” which is also 
reflected in the generally small odds ratios and limited predictive 
value [53, 54] of these variants, has raised questions about the 
ultimate applicability of these findings to risk prediction in particular 
to those variants that will be clinically actionable [55, 56]. 

4.3 Future Direction In order to elucidate additional genetic variation in complex disease 

and account for the “missing heritability,” researchers will need to 
improve on genome-wide study designs, phenotyping, and data 
analysis, as well as combining complementary data collected from 
multiple investigations. 

Deeper sequencing-based characterization of genomic variation, 
fine mapping, imputation, and denser SNP arrays are extending 
the reach of GWAS to ever lower ranges of minor allele frequency 
[57-60]. As genomic technologies improve, detection of associa¬ 
tions with variants of lower frequency is increasingly becoming 
possible. As large-scale parallel-sequencing studies of many thou¬ 
sands of individuals become commonplace, then sufficient power is 
likely to be gained, allowing both rare and common variants to be 
dissected to a greater extent. 

Improving on current study designs (i.e., sample size, extreme 
phenotypes, better phenotyping, and inclusion of family-based 
studies) may help shed light on fundamental genes, pathways, and 
cell types involved in disease, and help explain some of the so-called 
“missing heritability” associated with complex disease. 
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As sample sizes increase, so too do the number of identified 
genomic regions and the amount of variation explained by associa¬ 
tion studies. Increasing sample size will have the greatest effect on 
power. Replacing high-density SNP chips with full sequencing will 
tag low-frequency loci, but it will not be enough alone to capture 
the effects of rare variants, because many rare variants will be at 
such low number that large data sets are required for their detection. 
Another general strategy for increasing power is to focus on samples 
with extreme phenotypes and whose relatives have similarly extreme 
phenotypes [61-63]. 

Misclassifying a phenotype, especially when multiple distinct 
phenotypes are influenced by different sets of underlying causal 
variants, can reduce power in a GWAS investigation. Phenotyping 
may be inaccurate, thus combining phenotypes or diseases that 
have partially or even completely distinct underlying causal variants. 
This will average effect sizes across groups of individuals, who could 
be better separated on the basis of better phenotyping or a combi¬ 
nation of information from different sources. 

Overlapping GWAS results with other genomic sources of 
information is likely to explain additional variation and identify 
novel pathways. The targeting of expression SNPs and the linking 
of GWAS, gene expression, and methylation data have uncovered 
additional variants and provided direct information on the under¬ 
lying biology of complex phenotypes [64]. Finally, one cannot 
assess a single gene in isolation. Careful consideration should be 
given to developing a genetic risk score involving multiple loci as 
demonstrated by Chen et al. [65]. It will be necessary to integrate 
relevant clinical information and environmental risk factors. The 
inclusion of disease-specific, environmental, and genetic informa¬ 
tion will likely enhance the predictive capacity of any model. Ideally, 
these predictive algorithms require prospective evaluation in ran¬ 
domized controlled trials to assess the utility of including genetic 
information [66]. 


5 Conclusion 


The identification of genes of a complex trait remains a difficult 
task that requires substantial resources, including a large collection 
of samples (including families), highly informative markers, technical 
expertise, and high-throughput genotyping, as well as sophisti¬ 
cated statistical approaches. However, due to the wealth of data 
which emanated from the Human Genome Project and HapMap 
project, as well as the experience attained from the numerous linkage 
and association studies, gene variant identification is now possible 
for complex genetic disease. 

The two most common approaches are genetic linkage and 
association studies with the latter having greater power than 
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linkage studies to detect genotype relative risk of modest effect. 
With advances of genotyping technology and well-validated SNPs 
extensively covering the genome, there is great promise for this 
method for identification of genetic determinants for complex dis¬ 
ease. However, because this approach requires many more samples 
and markers its interpretation should be viewed with caution until 
it is independently replicated. 

GWAS have identified many thousands of significant associations 
across several hundred human phenotypes, and it is clear that, for 
any given trait, genetic variance is likely contributed from a large 
number of loci across the entire allele frequency spectrum. The 
proposed framework for future studies briefly outlined above will 
likely result in the identification of rare variants, which will help 
elucidate the genetic variation of a range of complex traits and 
hopefully solve the mystery of “missing heritability.” Moreover, 
these steps will improve our ability to predict disease risk and identify 


new drug targets. 
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Clinical Genetic Research 3: Genetics ELSI 
(Ethical, Legal, and Social Issues) Research 

Daryl Pullman and Holly Etchegary 

Abstract 

ELSI (Ethical, Legal, and Social Issues) is a widely used acronym in the bioethics literature that encompasses 
a broad range of research areas involved in examining the various impacts of science and technology on 
society. In Canada, GE 3 LS (Genetics, Ethical, Economic, Environmental, Legal, Social issues) is the term 
used to describe ELSI studies. It is intentionally more expansive in that GE 3 LS explicitly brings economic 
and environmental issues under its purview. ELSI/GE 3 LS research has become increasingly important in 
recent years as there has been a greater emphasis on “translational research” that moves genomics from the 
bench to the clinic. The purpose of this chapter is to outline a range of ELSI-related work that might be 
conducted as part of a large scale genetics or genomics research project, and to provide some practical 
insights on how a scientific research team might incorporate a strong and effective ELSI program within 
its broader research mandate. We begin by describing the historical context of ELSI research and the 
development of GE 3 LS research in the Canadian context. We then illustrate how some ELSI research 
might unfold by outlining a variety of research questions and the various methodologies that might be 
employed in addressing them in an area of ELSI research that is encompassed under the term “public 
engagement.” We conclude with some practical pointers about how to build an effective ELSI/GE 3 LS 
team and focus within a broader scientific research program. 

Key words Ethical, Legal, Social issues (ELSI), GE 3 LS, Clinical genetics, Mixed methods 


1 Introduction: What Is ELSI Research? 

ELSI (Ethical, Legal, and Social Issues) is a widely used acronym 
in the bioethics literature that encompasses a broad range of 
research areas involved in examining the various impacts of sci¬ 
ence and technology on society more generally. GE 3 LS (Genetics, 
Ethical, Economic, Environmental, Legal, Social issues) is a made 
in Canada variation on ELSI studies; it is intentionally more 
expansive in that GE 3 LS explicitly brings economic and environ¬ 
mental issues under its purview. In this chapter we use the ELSI 
acronym generally as it is more commonly found in the wider 
scientific literature. However, we occasionally refer to GE 3 LS 
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when discussing how some of these activities have developed in 
Canada in particular. 

In the past decade, there has been an increasing interest in 
“translational research” that moves the outputs of bench science to 
the bedside or the marketplace in an effective, economical and ethi¬ 
cal manner [1, 2]. This translational emphasis sometimes means 
that scientific teams must focus more of their attention and resources 
on the product delivery end of the research pipeline, and compara¬ 
tively less on the discovery end. While ELSI arise at each phase of 
the research pipeline process, they are especially acute in the trans¬ 
lational context when broader issues of science policy, technology 
assessment, user preparedness to adopt new technologies, and con¬ 
sumer willingness to utilize them, all become more immediate. 
ELSI research is often instrumental in the translational aspect of 
scientific research and many research teams are actively engaging 
ELSI experts to assist with this translational mandate [3]. Working 
closely with an appropriate team of ELSI researchers can assist 
translational genomics teams to identify and address potential bar¬ 
riers to the uptake of the products of their work. The purpose of 
this chapter is to outline a range of ELSI-related work that might be 
conducted as part of a large scale genetics or genomics research 
project, and to provide some practical insights on how a scientific 
research team might incorporate a strong and effective ELSI 
program within its broader research mandate. 

At its heart, ELSI research in genomics is interdisciplinary; it 
brings together researchers from a wide range of academic disci¬ 
plines spanning the life, clinical, and social sciences. It may also 
include a range of other stakeholder or user groups (e.g., policy¬ 
makers, commercial partners, special interest groups, or research 
populations). ELSI research includes philosophers, bioethicists, 
legal scholars, policy experts, economists, geographers, communi¬ 
cations experts, humanities scholars, and social scientists from 
virtually every social science discipline (anthropology, archaeology, 
psychology, sociology, to name only a few). In short, ELSI research 
is as diverse as the range of disciplines represented and it draws 
upon the research methodologies that each of these disciplines 
brings to the issues under investigation. 

In this chapter, we begin with a brief historical overview of 
the advent of ELSI research in general and GE 3 LS research in par¬ 
ticular as the latter has evolved over the past decade in the Canadian 
context. We then discuss the important distinction between 
descriptive and normative research that is important to understand 
when determining the kinds of ELSI questions that need investiga¬ 
tion in relation to a particular scientific or clinical research project, 
or in assessing the outputs of ELSI research. We then illustrate 
how some of this work might unfold by outlining a variety of 
research questions and the various methodologies that might be 
employed in an area of ELSI research encompassed under the term 
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“public engagement.” We conclude with some practical pointers 
about how to build an effective ELSI/GE 3 LS team and focus 
within a broader scientific research program. 


2 Advent of ELSI/GE 3 LS Research 

The modern era of genetics and genomics research can be traced to 
the inception of the Human Genome Project which was proposed in 
the 1980s and formally initiated in 1990. From the outset there was 
an emerging concern about what the implications of the genomics 
era might mean for humankind more generally. Very early on, the 
Human Genome Organization—an international oversight body 
that was instituted at the first meeting on genome mapping and 
sequencing at Cold Spring Harbor in April 1988 [4] —established 
an ELSI committee to identify and attend to broader ethical, legal 
and social issues that were anticipated to arise. That committee 
issued a “Statement on the Principled Conduct of Genetics Research” 
in 1995 [5] and the committee continues to provide ELSI oversight 
as new issues arise. While HUGO’S mandate was international in 
scope, a number of national bodies also evolved in the succeeding 
years. In the USA the National Human Genome Research Institute 
(NHGRI) was established in 1989 to support and facilitate genom¬ 
ics research. The NHGRI has embraced an ELSI mandate and 
continues to support research in this area [2, 3]. 

In Canada, the major impetus for genomics research came with 
the creation of Genome Canada, a not-for-profit government 
agency established in 2000 to leverage research in genomics and 
proteomics for the benefit of all Canadians. Genome Canada’s 
mandate includes not only human genetics research, but also 
genetic and genomics research in forestry, agricultural, aquaculture 
and related biosciences. Hence the ELSI mandate expanded accord¬ 
ingly to include economic and environmental concerns. The GE 3 LS 
acronym captures this broader mandate more explicitly. 

While Genome Canada was not unique in recognizing the 
relevance of ELSI/GE 3 LS research to the genomics enterprise, 
it was unique in that it made the inclusion of a GE 3 LS research 
component a mandatory requirement for all of the large scale sci¬ 
ence projects it funded. At least three different models of GE 3 LS 
research evolved over the early years of Genome Canada’s mandate 
which came to be known respectively as “stand-alone,” “embed¬ 
ded,” and “integrated” GE 3 LS research. “Stand-alone projects,” as 
the name suggests, were GE 3 LS projects that were funded directly 
by Genome Canada through one of its regional genomics research 
centers and which operated independently of any science related 
project. Such stand-alone projects are similar to what other 
genomic research organizations might fund under their ELSI pro¬ 
grams. “Embedded projects” is GE 3 LS research that is funded 
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within a larger genomics science project but in which there is no 
necessary connection between the science and the GE 3 LS research 
being conducted. For all intents and purposes stand-alone and 
embedded projects are essentially the same; it is only the manner in 
which they are funded that differs (i.e., stand-alone projects are 
funded directly by Genome Canada while embedded projects are 
funded indirectly through a science platform). The most interest¬ 
ing, challenging, but also often the most rewarding GE 3 LS research 
projects are those that are integrated into the science project from 
which it draws its funding. In actuality a properly integrated GE 3 LS 
project is not separate from a science project but is part and parcel 
of the project from its inception. What this means is that from the 
beginning of the project when researchers are deciding upon the 
research questions they want to explore and the manner in which 
they will construct their research proposal for funding purposes, 
appropriate GE 3 LS researchers are involved in drafting the research 
proposal to identify specific GE 3 LS issues that arise out of the pro¬ 
posed scientific or clinical research, and to develop appropriate 
methods for addressing them. Thus, it is not uncommon that one 
of the co-principal investigators on a large scale genomics project 
would be a GE 3 LS expert responsible for working closely with the 
lead scientists and clinicians and for over-seeing the potentially 
broad ranging GE 3 LS mandate. 

While integrated GE 3 LS research can be the most rewarding, 
it is also often the most challenging as considerable energies must 
often be invested to overcome disciplinary barriers. Thus, Collins 
et al. [3] observe: “New mechanisms for promoting dialogue and 
collaboration between the ELSI researchers and genomic and clin¬ 
ical researchers need to be developed; such examples might include 
structural rewards for interdisciplinary research, intensive summer 
courses or mini-fellowships for cross-training, and the creation of 
centers of excellence in ELSI studies to allow sustained interdisci¬ 
plinary collaboration.” A first step in promoting dialogue and col¬ 
laboration between ELSI and genomic and clinical researchers is 
for those from disparate research orientations to understand some¬ 
thing of the other’s perspective. In point of fact, as already noted, 
there is no particular ELSI perspective as ELSI encompasses a wide 
and diverse range of research methodologies. Nevertheless, there 
are some basic distinctions that help to navigate the ELSI domain 
and which can inform decisions about the kinds of ELSI questions 
that might be explored. The first of these is the distinction between 
descriptive and normative research. 

2.1 “Descriptive” 

Versus “Normative” 

Research 


Descriptive and normative research both figure prominently in 
ELSI work. While the outputs of each type of investigation are 
often complementary, the general methodologies can differ con¬ 
siderably. “Descriptive” research, as the name implies, sets out to 
examine the world as we encounter it and to understand the way 
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things actually are. This type of research is most familiar to basic 
scientists and clinical researchers as they set out to understand vari¬ 
ous physical phenomena such as gene functions and clinical mani¬ 
festations of a disease, for example. ELSI researchers who focus 
primarily on descriptive work do much the same thing, although 
the phenomena they describe are generally social. If we are inter¬ 
ested in knowing whether the public is concerned about the pri¬ 
vacy of their genetic information, for example, or if there is general 
anxiety about whether life insurers might require them to have a 
genetic test before being considered for insurance purposes, social 
scientists might develop a survey or conduct focus groups with the 
goal of understanding what people actually feel or believe about 
genetic privacy or genetic discrimination issues (see our discussion 
of “public engagement” outlined later in this chapter). This kind 
of work is descriptive; it aims simply to ascertain what people are 
thinking or feeling without making any judgment about whether 
such views or beliefs are good or bad, right or wrong. Researchers 
have found, for example, that people who suffer from Huntington’s 
disease experience discrimination when seeking life insurance [6]. 
While this may be descriptively true, we still don’t know what fol¬ 
lows from it normatively. That is, should life insurance companies 
be prohibited from asking questions about family history? Should 
they be required to insure people with terminal illnesses? These 
latter questions are “normative,” and no amount of empirical work 
will provide definitive answers on such issues. 

Normative or prescriptive work aims to provide arguments in 
support of some preferred view of how things ought to be. Put 
otherwise, the normative challenge is to derive an “ought” from an 
“is.” Such normative work is generally more conceptual than 
empirical in nature, and at times a normative conclusion about 
how things ought to be might go contrary to a descriptive observa¬ 
tion about what is actually the case. That being said, normative 
conclusions are generally responsive to empirical realities. In 2008, 
for example, the USA introduced federal legislation (the “Genetic 
Information Nondiscrimination Act,” better known as “GINA”) 
to prohibit discrimination with regard to employment and health 
[7]. Insofar as the vast majority of American’s receive health insur¬ 
ance through their employers (descriptively true), the US Congress 
decided it was necessary to have a law that prohibited employers 
from using genetic information when screening potential employ¬ 
ees (a normative process). Canada does not have any legislation 
similar to GINA, although there have been some that have lobbied 
for it. The reality is, however, that because Canada has a program of 
universal health care Canadians do not rely upon their employers 
for health insurance. This empirical difference results in a different 
normative conclusion for Canada [8]. 

Generally ELSI research spans both the descriptive and norma¬ 
tive domains. It includes both social scientists with expertise in 
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exploring the social world which will be impacted by the outputs 
of genetic research, as well as normative researchers like ethicists, 
health policy experts, and legal scholars who are charged with 
addressing the most appropriate ways for translating that research 
to the greater social good. The methodologies involved vary con¬ 
siderably but all are essential to the translational task. ELSI 
researchers have generally developed good working relationships 
across the various disciplines involved in addressing these descrip¬ 
tive and normative tasks. The effectiveness of their work increases 
exponentially when they can engage genomics and clinical research¬ 
ers directly in the ELSI process as well. 


3 Public Engagement with Genomics: An Instructive Case Study of ELSI 
Research Topics and Methods 


3.1 Why Is Public 
Engagement 
a Significant ELSI 
Focus in Genetics 
Research? 


A significant portion of ELSI genetics research over the last two 
decades has focused on public attitudes towards, and engagement 
with, new genomic developments. This research is instructive in 
both the number of ELSI topics and content areas highlighted, as 
well as the variety of research methods employed. In this section 
we describe a number of empirical research studies that highlight 
significant ELSI concerns in clinical genetics research, as well as the 
breadth of mixed methods employed in ELSI research designs. 

While promoting public participation in policy decisions is not new, 
there is a growing emphasis in both academic and policy circles 
on the importance and necessity of public involvement [9, 10], 
particularly in health contexts [11], and most recently, in the area 
of genetics and personalized medicine [12-14]. Community 
engagement is endorsed by many federal agencies as a way to 
“build trust, enlist new resources and allies, create better commu¬ 
nication and improve health outcomes” [9]. At the same time, 
research efforts to improve population and individual health 
increasingly rely on large-scale collections of individuals’ genetic 
information, linked with other health, lifestyle, and administrative 
data. Such collections or “biobanks,” are often used in prospective, 
longitudinal cohort studies and have become standard research 
tools to investigate the interactive effects of genes, environment, 
and lifestyle on health and disease [12, 13, 15]. New genomic 
sequencing technologies such as whole-genome sequencing (which 
measures variation across an individual’s entire genome) and whole- 
exome sequencing (which measures variation only in the portion of 
DNA that encodes for proteins) are increasingly guiding clinical 
practice for a number of disorders [16-18]. While such develop¬ 
ments offer the potential for genomic information to improve health 
outcomes, they are also associated with a number of significant 
ELSI (e.g., clinical validity and utility of the information 
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generated, data management and sharing, and informed consent 
models for such research). These new developments in genomics 
continue to increase the number of disorders for which genetic 
testing is available, whether through the primary health-care system, 
direct to consumer (DTC) testing via the internet [19, 20], or as 
part of expanded newborn screening panels [21]. Not surprisingly, 
there is a growing interest on the part of policy-makers and schol¬ 
ars alike (i.e., those responsible for the normative outputs of ELSI 
studies) in public attitudes towards these continued developments 
and the ELSI issues they raise. 


3.2 An Overview 
of Pubic Engagement 
Approaches 
in Genetics ELSI 
Research 


A variety of methods have been employed to engage communities in 
genomics consultation initiatives including surveys, focus groups, 
town hall meetings, citizens’ juries, and Web or community forums, 
to name just a few. Engagement can have a number of different 
levels and has been described as existing on a continuum or ladder, 
ranging from simply providing information, to more substantial 
community consultation, through to communities having an equal 
share in the decision-making power [11, 22]. Early ELSI research 
was largely at the public information level and aimed to explore pub¬ 
lic interest and uptake in biobank research and their attitudes towards 
the associated ELSI (such as informed consent processes and the 
return of individual research results). These studies largely employed 
national, random surveys of the general public. 

ELSI studies at times employ large random surveys of the gen¬ 
eral public, even at the national level. For example, there was wide¬ 
spread support in the USA (84 %) for the creation of a large genetic 
cohort study with 60 % of Americans indicating they would become 
donors [23]. Interest in the study and willingness to participate did 
not significantly vary among demographic groups. Notably, how¬ 
ever, features of the research such as study burden and whether 
individual results would be returned to participants did affect will¬ 
ingness to participate. Similarly, majority (83 %) of patients in a 
large Veterans Affairs’ patient database indicated their support for 
the creation of a large genomic research study, and 71 % indicated 
they would likely participate [24]. Similar support for genomics and 
biobanking research has been observed in national surveys across 
Canada [13, 25], other US locales [15], Sweden [26], Scotland 
[27], as well as in large international efforts (e.g., the International 
HapMap Project) [28]. 

Other ELSI research efforts have moved beyond merely solic¬ 
iting information from communities and assessing their research 
participation interest to provide more substantive consultation 
opportunities. These ELSI research studies have used a variety of 
methods, including town hall meetings, the creation of community 
advisory boards, and engagement forums, to name a few. In an 
instructive report, Lemke and colleagues [29] described commu¬ 
nity engagement activities undertaken by six biobanks in the USA, 
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highlighting a range of public engagement mechanisms. Biobank 
governance was informed by community surveys and focus groups 
with patient and specialist groups, but also by consensus develop¬ 
ment panels, community advisory panels, and deliberative democ¬ 
racy events. Case studies such as these are especially informative; 
they reveal the pragmatic and policy outcomes of ELSI engage¬ 
ment efforts (e.g., influencing the choice of biobank consent 
processes or the policy on the return of individual research results) 
showing again how the descriptive informs the normative. These 
research efforts also reveal the challenges of community engage¬ 
ment (e.g., engaging the public on a topic about which many claim 
they are uninformed, or recruiting a diverse range of stakeholders 
for advisory groups and panels). 

ELSI research in public engagement with genetics has high¬ 
lighted a key area of concern about genetic health literacy. At least 
some knowledge (and perhaps prior thought) about many of the 
ELSI in genomics and personalized medicine (e.g., clinical utility 
of tests, policies on data sharing, biobank governance) are neces¬ 
sary if the public is to participate in policy discussions in a mean¬ 
ingful way. In order to raise awareness about personalized medicine, 
the NHGRI hosts a series of ELSI community engagement pro¬ 
grams including the Family History Demonstration projects and 
the Community Genetics Forum that have been well received [30]. 
These efforts represent novel ELSI research approaches and are 
designed to facilitate community dialogue about the connections 
between genetics and health. Pragmatically, they also provide edu¬ 
cational curricula and materials for community groups and others 
wishing to engage with issues around genetics and personalized 
medicine. Ongoing ELSI research efforts in the authors’ local 
jurisdiction are also intended to raise public awareness about genet¬ 
ics and health. In Newfoundland, for example, we have conducted 
public surveys, as well as delivered community education and con¬ 
sultation sessions about newborn screening, genomics research and 
specific issues related to biobanking (e.g., consent models) [31-33]. 
These projects employed a number of ELSI methods including 
both qualitative (focus groups, open survey items) and quantitative 
(e.g., conjoint analysis) approaches. In accordance with prior 
research [34, 35], all these public engagement efforts revealed a 
largely positive attitude towards the potential for genomic medicine 
to improve health, but also areas ELSI of public concern such as the 
privacy of genetic information, storage of and access to the informa¬ 
tion, as well as questions about the clinical utility of genomic infor¬ 
mation for health. Notably, these are the very ELSI with which 
policy makers and health-care systems will continue to grapple as 
genomic medicine is integrated into current health-care systems. 

On the end of the continuum of public engagement methods 
are those that are far more ambitious than projects described thus 
far. The deliberative democracy approach employed by the British 
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Columbia biobank in Canada has been well described [13, 36]. 
In this stand-alone ELSI project, a diverse group of citizens com¬ 
mitted to a two-weekend deliberation on the values that should 
guide biobanking in their province. Participants were charged with 
reaching consensus (if possible) on a range of important ELSI such 
as consent procedures, biobank governance, and access to biobank 
data. Consensus was reached in several areas (e.g., public support 
for the biobank, an independent governing body, and standardiza¬ 
tion in biobank tools and procedures). However, several areas of 
disagreement remained (e.g., one-time blanket consent, donor 
compensation, and the ownership of biobank biological samples). 
The results of the deliberations were provided to the BC Biobank 
and ongoing dialogue will assist with incorporating the results 
into the review of policies of the biobank and future community 
engagement efforts [13]. 

A similarly ambitious and novel community engagement 
approach was employed with a group of young offenders in South 
Wales [37]. In that ELSI project, a mock jury trial engaged youth 
with ELSI raised by the creation of a National DNA database. Still 
other ELSI research developed a “Genome Diner” community 
engagement approach which brought together scientific experts, 
as well as school children and their parents to deliberate ELSI asso¬ 
ciated with genomic research [38]. Interactive discussions were 
held in school cafeterias, arranged as a “menu”: Appetizers (warm 
up questions), Main Course (specific discussion topics) and Dessert 
(summary of discussions from each table). Participants evaluated 
the program highly, and geneticists in particular demonstrated a 
greater knowledge about and more favorable attitudes toward the 
public’s ability to contribute to genomics policy and discussion. 

In Canada, a deliberative workshop approach was used to 
explore issues raised by the inclusion of genomic profiling in two 
routine health-care contexts—risk assessment for colorectal cancer 
or type 1 diabetes as part of newborn bloodspot screening [39]. 
Workshops lasted for 2-3 h and included three components: an 
information component, a deliberation component, and a data col¬ 
lection component. The information component provided descrip¬ 
tions of the genomic technology of interest, as well as its possible 
implications (both positive and negative), in standard PowerPoint 
format. The deliberation component provided an opportunity for 
questions, discussion and debate about the information presented. 
The data collection component used multiple approaches to capture 
participants’ reactions and attitudes (e.g., free form booklet 
responses, Likert-type attitude items, and group discussion field 
notes). In total, eight workshops (n=170) in two provinces (NL 
and ON) were completed. Results were consistent with existing 
ELSI literature in that attitudes were generally positive, but with 
notable areas of concern. Lor example, community members were 
concerned about the clinical utility and validity of genomic 
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information, access to biobank data, and the potential for negative 
psychosocial effects such as undue worry about uncertain disease- 
risk estimates. 

These ELSI research efforts reveal that the public is open to 
discussion about genomics and personalized medicine and tend to 
rate public engagement efforts on these topics highly. Public infor¬ 
mation studies in particular (e.g., surveys, focus groups, opinion 
polls) reveal that the public has a largely positive attitude about the 
potential for genomics to improve health, and most indicate they 
would participate in biobanks and other large genetic cohort stud¬ 
ies. These ELSI studies also reveal, however, that critical elements 
such as informed consent processes, data protection and sharing 
regulations, biobank and data registry governance, as well as the 
validity and utility of genomic information for medical decision 
making must be addressed to assure the public of the potential 
worth of personalized medicine. They are also instructive in high¬ 
lighting the range of ELSI that have been the focus of much 
genomics research in the last two decades, as well as the range of 
methodologies employed by ELSI researchers. 


4 Practical Considerations When Undertaking an ELSI Initiative 

ELSI research is an inherently complex undertaking, particularly 
for first-time interdisciplinary researchers. Tait and Lyall note that 
“Interdisciplinary research often requires more resources of time, 
effort, imagination, and money than single discipline research 
(and may also involve higher risks of failure) but the rewards can 
be substantial, in terms of advancing the knowledge base and help¬ 
ing to solve complex societal problems” [40]. In this final section 
we outline some practical considerations for those wishing to 
engage in interdisciplinary ELSI research in genomics. 

1. Accept that ELSI research will likely require more time upfront 
and a preliminary research phase that is somewhat open-ended. 
Extra time will be needed to promote the formation of a cohe¬ 
sive research team with the correct disciplines represented 
given the nature of the research question. The ELSI questions 
themselves may take some time to negotiate. Initial discussions 
with team members will involve specifying the parameters on 
research questions and methodological approaches that are rel¬ 
evant. Stokols and colleagues [41] note that teams need to 
develop “shared conceptual frameworks that integrate and 
transcend the multiple disciplinary perspectives represented 
among team members” (p. S97). 

In an integrated ELSI project, it is essential that ELSI 
team members are engaged very early on. This includes the 
integration of ELSI researchers during the funding proposal 
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stage as they will be instrumental in drafting portions of the 
proposal to identify specific ELSI that arise from the scientific 
or clinical research. Avoid contacting an ELSI researcher a few 
days before the funding deadline and asking him or her to 
provide a page or two on some of the issues that might arise 
out of the project. Given the broad range of ELSI research 
that might be undertaken with any large scale genomics proj¬ 
ect, the ELSI lead researcher will need time to identify both 
the key ELSI objectives and the kinds of expertise necessary to 
address them. 

2. Choose team members with interdisciplinary characteristics 
such as flexibility, adaptability, creativity, and a real willingness 
to keep an open mind and be curious about ideas from other 
disciplines and backgrounds [40, 41]. It follows that good 
communication and listening skills are essential. 

3. Allow an adequate budget for ELSI research in genomics. This 
might include the standard costs of ELSI research descriptive 
methodologies (e.g., survey administration, transcription of 
interviews), but might also include other ELSI activities that 
are part of the broader normative research plan (e.g., expert 
consensus meetings, educational tool development). An ELSI 
budget like any research budget will depend on the questions 
one hopes to answer, and the outputs to be achieved. As ELSI 
methodologies vary widely, budgets will vary as well. As a rule 
of thumb, however, it is our experiences that an integrated 
ELSI budget should comprise approximately 5-8 % of the 
overall scientific or clinical project. 

4. Schedule regular team meetings that include not only ELSI 
team members, but also members of the broader scientific or 
clinical project. In our experience, monthly (or at least quar¬ 
terly) team meetings provide a relatively informal opportunity 
for team members to interact, which can promote team cohe¬ 
sion and function [41]. Practically, these are useful venues to 
keep all team members informed of the various stages of the 
research project and provide excellent training opportunities 
for students and other trainees. Non-ELSI students become 
exposed early to the idea of integrated ELSI research and learn 
in an active way about the facilitators and barriers to interdisci¬ 
plinary research. In addition, ELSI students will gain some 
familiarity with the process and content of scientific and clini¬ 
cal research. Finally, regular meetings also promote network¬ 
ing opportunities that may be important when putting together 
a team for a future research project. 

5. Discuss publication process and authorship requirements 
up-front. Determine a systematic way to ensure that all and 
only legitimate contributors are listed on each publication and 
that contributions are acknowledged appropriately. Different 


380 


Daryl Pullman and Holly Etchegary 


disciplines have different standards and these need to be 
acknowledged and negotiated from the outset. For example, 
multiauthored papers are common in the basic and social sci¬ 
ences, but less so in the humanities. In the basic sciences first 
and last author are generally considered the most significant, 
while in the social sciences the order of authorship is often 
more important; contributors are often arranged in a descend¬ 
ing order. If two or three contributors did the bulk of the work 
while the remaining coauthors contributed equally, the first 
two or three authors are ordered according to their relative 
contributions while the rest would be added in alphabetical 
order. In the humanities, by contrast, multiauthored works 
(i.e., more than two or three co-authors) are rare, although it 
is common to have a lengthy acknowledgement section at the 
end of a paper. Given the importance of a publication record to 
the careers of most academics, sorting out how these proce¬ 
dural details will be managed at the outset will be important. 
Even when authorship criteria are determined it is often advis¬ 
able to have a publication subcommittee that evaluates each 
publication produced in the project before it goes out for 
review. Who should be included as authors and in what order 
their names should appear may depend to some degree on the 
journal in which the team hopes to place the publication. 


5 Conclusion 


Given the continuing emphasis on genomic research in general and 
the increasing pressure to do translational research in particular, 
the question is not whether one should engage in ELSI research, 
but rather when and how to do so most effectively. Our hope is 
that this chapter has provided a useful overview of the nature and 
scope of ELSI research, and some practical pointers on how to 
engage in it successfully. 
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Evidence-Based Decision-Making 1: Critical Appraisal 

Laurie K. Twells 

Abstract 

This chapter provides an introduction to the concept of Evidence-based Medicine (EBM) including its 
history, rooted in Canada and its important role in modern medicine. The chapter both defines EBM and 
explains the process of conducting EBM. It includes a discussion of the hierarchy of evidence that exists 
with reference to common methods used to assess the levels of quality inherent in study designs. The focus 
of the chapter is on how to critically appraise the medical literature, as one step in the EBM process. 
Critical appraisal requires an understanding of the strengths and weaknesses of study design and how these 
in turn impact the validity and applicability of research findings. Strong critical appraisal skills are critical to 
evidence-based decision-making. 

Key words Evidence-based medicine, Critical appraisal, Study design 


1 Introduction 


Evidence-based medicine (EBM) is “the conscientious, explicit 
and judicious use of current best evidence in making decisions 
about the care of individual patients” [1]. It means integrating 
individual clinical expertise with the best available external clini¬ 
cal evidence from systematic research [1,2]. More recently, it has 
been further defined as the integration of best research evidence 
with clinical expertise and patient values [3]. The process of EBM 
involves formulating a clinical question, searching and obtaining 
the best evidence to answer the question, critically appraising the 
evidence to ensure its validity and applicability, and implementing 
the findings in practice [ 1 ]. Dr. David Sackett is often regarded as 
the “father of evidence-based medicine” although the term is said 
to have been first used by Dr. Gordon Guyatt in the 1990s [4, 5]. 
EBM is a process that grew out of the need for medical education 
to move away from patient care based solely on “expert opinion” 
to that based on best evidence [3]. Although now just one step in 
the process, it is interesting that EBM grew out of critical 
appraisal—the assessment of the validity of scientific literature 
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and its practical relevance to patient care [ 1 ]. Critical appraisal of 
the scientific literature was advanced by David Sackett and Brian 
Haynes at McMaster University in the early 1980s when they 
published a series of articles in the Canadian Medical Association 
Journal (CMAJ) entitled “How to read clinical journals” with 
various subtopics that included: the etiology or causation of dis¬ 
ease, quality of care, the usefulness or harm associated with ther¬ 
apy, and the utility of diagnostic tests [6]. Following these articles, 
Sackett wrote the seminal text for students “Clinical Epidemiology: 
A Basic Science for Clinical Medicine” now in its 3rd edition and 
often referred to as “the bible” of EBM [3, 7, 8]. Over the next 
two decades, the CMAJ articles were further refined and led to 
the establishment of an EBM Working Group that subsequently 
developed a series of 25 papers known as the JAMA User’s Guide 
to the Medical Literature. These guides were initially developed 
for clinicians to help them interpret the medical literature and 
support clinical decision-making [9]. The success of this series of 
papers provided the impetus for both the JAMA User’s Guide to 
the Medical Literature , a textbook (in its 6th printing), as well as 
the development of a user-friendly, publically available website 
that houses numerous resources for supporting the practice of 
EBM (http://www.jamaevidence.com). The articles, text and 
website include a number of EBM resources, structured guides 
on how to appraise papers on topics such as therapy, diagnosis, 
prognosis, quality of care, economic analysis and overviews, and 
are considered by many as the definitive checklists for critical 
appraisal [10]. 


2 The Process of Evidence-Based Medicine 

In the opening editorial of the very first issue of the journal 
Evidence-Based Medicine , the essential steps in this emerging sci¬ 
ence of EBM were summarized. These included: to convert infor¬ 
mation needs into answerable questions (i.e., to formulate the 
problem); to track down, with maximum efficiency the best evi¬ 
dence to answer these questions; to appraise the evidence critically 
in order to assess its validity (or truthfulness) and its applicability (or 
usefulness); to implement the results of the appraisal into clinical 
practice and to evaluate performance [11, 12]. 

This process is often illustrated using Steps or an A’s approach 
shown in Table 1. This chapter is an introduction to Step 4, to 
“Appraise” the medical literature in order to assess its validity and 
applicability. The process of critical appraisal is a very important 
part, albeit one step, in the EBM process due to two key principles. 
First, not all evidence is considered equal, and second, a hierarchy 
of evidence exists linked to its design and inherent methodology. 
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Step 1 

Assess important patient or policy problems 

Step 2 

Ask well-defined clinical questions from case scenarios, the answer to which will inform 
decision-making. 

Step 3 

Acquire information by selecting and searching the most appropriate resources 

Step 4 

Appraise the medical literature for its validity (closeness to the truth) and its 
applicability (usefulness in clinical practice) 

Step 5 

Apply the results of the appraisal of medical literature to make sound, reasoned clinical 
decisions taking into account patient preferences and values 

Step 6 

Assess or evaluate performance in applying the evidence 


Appraising evidence requires an understanding of the strengths 
and weaknesses of epidemiological study design and how these in 
turn affect the validity and applicability of study findings [10]. 


3 Levels of Scientific Evidence 

A number of classification systems have been developed to assess 
and describe the varying levels of evidence associated with different 
study designs. Although there is some debate over the strengths of 
individual study methods, there is a general consensus that a hier¬ 
archy of evidence exists. Various study designs will provide differ¬ 
ing levels of evidence to support a treatment effect or causal 
relationship by limiting systematic bias [3, 8, 10]. This hierarchy of 
evidence is most often illustrated by a pyramid or similar graphic 
that places the types of evidence in the following order of decreas¬ 
ing strength: 

1. Systematic reviews and Meta-analysis. 

2. Randomized Controlled Trials. 

3. Cohort studies. 

4. Case-control studies. 

5. Cross-sectional studies. 

6. Case series/Case reports. 

7. Expert opinion. 

A very brief summary of these main study designs is provided 
here. For more detailed information please refer to other chap¬ 
ters in this textbook. Epidemiological research studies are divided 
into experimental/intervention or observational studies and 
with the exception of randomized controlled trials, the only 
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experimental study, most are observational in nature. At the top 
of the pyramid are studies that summarize other studies. 
Systematic reviews (SR) are produced by systematically search¬ 
ing, critically appraising, and synthesizing available literature on 
a specific topic (e.g., the difference between parental perception 
and actual weight status of children: a systematic review). A SR 
and meta-analysis includes a quantitative summary of all study 
results, the benefit being an increased power to assess the effec¬ 
tiveness (or lack of) of an intervention (e.g., the effectiveness 
and risks of bariatric surgery: an updated systematic review and 
met a-analysis). Clearly, the quality of the meta-analysis is depen¬ 
dent on the quality of the RCT’s included. In some instances a 
high quality RCT will dominate the evidence base. In other 
instances, a meta-analysis will reveal a weak evidence base with 
few trials homogenous for the intervention, design, patient 
groups and outcomes. An RCT, considered the gold standard in 
study design, is the only study design whereby participants are 
randomly allocated to an intervention/experimental arm (e.g., 
new cancer treatment) or a control arm (e.g., standard of 
care + placebo). Follow-up takes place over time to measure one 
or more outcomes of interest. Within a cohort study, a group of 
individuals exposed to a risk factor (e.g., diabetes mellitus) is 
compared to a similar unexposed group and an outcome(s) (e.g., 
premature mortality) is assessed over a specific time period. 
Cohort studies can be either prospective or retrospective in 
nature depending on the nature of data collection. In a case- 
control study, a group of individuals with a disease/outcome of 
interest (e.g., birth limb defects) are identified and compared to 
a control group with respect to their past exposure status (e.g., 
medication use such as Thalidomide). Cross-sectional studies or 
prevalence studies classify subjects according to disease and 
exposure status. Data is often collected through health surveys 
and questionnaires (e.g., a health survey reports the prevalence 
of obesity and diabetes in a target population). A case report 
consists of a detailed report of a single patient while a case series 
provides information on more than one patient with the same 
features (e.g., four young men described with rare form of pneu¬ 
monia, led to the discovery of AIDS) [7, 10, 11]. 

3.1 Methods Used There are many examples of methods used by organizations to 

to Evaluate Scientific delineate the quality of evidence. Some of these include those 

Evidence developed by the: US Preventive Services Task Force (USPSTF); 

the Oxford Centre for Evidence-Based Medicine (CEBM) and the 
Grading of Recommendations Assessment, Development and 
Evaluation or GRADE working group. 
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3.1.1 The US Preventive 
Services Task Force 


3.1.2 The Oxford Centre 
for Evidence-Based 
Medicine 


Varying levels of evidence are used to rank the effectiveness of 
treatments or screening tools relevant to the primary care environ¬ 
ment and are classified using the following levels: 

• Level I: Evidence obtained from at least one properly designed 
randomized controlled trial, well-conducted systematic review 
or meta-analysis of homogeneous RCTs. 

• Level II-1: Evidence obtained from well-designed controlled 
trials without randomization. 

• Level II-2: Evidence obtained from well-designed cohort or 
case-control analytic studies. 

• Level II-3: Evidence obtained from multiple time series designs 
with or without the intervention; dramatic results from uncon¬ 
trolled experiments. 

• Level III: Opinions of respected authorities, based on clinical 
experience, descriptive studies, or reports of expert committees. 

Prior to the grading of levels, individual studies are critically 
appraised for internal validity based on specific criteria unique to each 
study design. Ultimately each study will be described as good (if a 
study meets all criteria), fair (if a study does not meet one criterion 
but does not have a fatal flaw) or poor (the study has a fatal flaw) in 
terms of methodological quality. For example, when critically apprais¬ 
ing an RCT the following descriptors could apply. A study could be 
described as (1) Good: if comparable groups were initially recruited 
and maintained throughout the study (follow-up at least 80 %); if 
reliable and valid measurement instruments were used and applied 
equally to the groups; if interventions were described clearly; if all 
important outcomes were reported, if confounders were taken into 
consideration and intention-to-treat (ITT) analysis was conducted. 
(2) Fair, although comparable groups were recruited at the start of 
the study period, questions in differences in follow-up exist; measure¬ 
ment instruments are acceptable and have been applied equally but 
may not be the best choice; some but not all important outcomes are 
considered; and some but not all potential confounders are accounted 
for. ITT is conducted. (3) Poor: groups recruited at the start of the 
study are not close to being comparable or maintained throughout 
the study; unreliable/invalid measurement instruments are used or 
not applied consistently among groups (including not blinding out¬ 
come assessment); key confounders are not accounted for; and ITT 
analysis is absent, ((http://www.uspreventiveservicestaskforce.org/ 
uspstf08 /methods/procmanual4.htm)) 

The Oxford Centre for Evidence-Based Medicine (CEBM) pro¬ 
vides a grading system (http://www.cebm.net/) to evaluate evi¬ 
dence for different types of questions that include those on therapy, 
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etiology, prevention, harm, prognosis, diagnosis, and economic 
analysis. The highest level of evidence is classified as la and refers 
to a SR with homogeneity (similar study methods) with the lowest 
level of evidence a 5 being expert opinion. An evaluation of evi¬ 
dence using these levels results in a recommendation by a grading 
system (A to D), with a Grade A recommendation suggesting con¬ 
sistent level 1 studies are available through to a Grade D recom¬ 
mendation that suggests only level 5 evidence is available or that 
alternate evidence is inconclusive. 


3.1.3 The Grading 
of Recommendations 
Assessment, Development, 
and Evaluation 


The GRADE working group (http://www.gradeworkinggroup. 
com) has further refined and developed the process of assessing the 
strength of a study by addressing more than just the quality of the 
research but also the impact other factors have on the confidence 
in study results. Similar to other systems, the quality of evidence is 
assessed on four levels (i.e., high, moderate, low, very low) while 
the confidence factor is based on judgments assigned in five different 
domains in a structured manner. For example, an RCT may be 
considered a high quality study with a low risk of bias, but depend¬ 
ing on its assessment in other domains, it may be downgraded due 
to: risk of bias (e.g., no allocation concealment); imprecision (i.e., 
random error); indirectness (e.g., population, interventions or 
outcomes differ from those of interest). A body of evidence may be 
downgraded due to inconsistency (e.g., different point estimates 
with nonoverlapping confidence intervals) or publication bias 
(e.g., small sample sizes with large treatment effects, commercially 
funded research). Alternatively, an observational study of moderate 
quality could be upgraded due to a large effect size or evidence of 
a dose-response relationship and would further support inferences 
of a treatment effect. 


4 Critical Appraisal: Basics 

Critical appraisal is the process of systematically assessing the valid¬ 
ity, usefulness, and relevance of the evidence [12]. The process can 
be divided into an examination of extrinsic and intrinsic factors. 
Extrinsic factors include taking note of the authors and their affili¬ 
ations, the journal, the funder, and the stated conflicts of interest 
[13]. Examining the intrinsic factors requires a rigorous assess¬ 
ment of study design and methodology- the focus of critical 
appraisal. A number of excellent resources have been developed to 
support the critical appraisal process (see EBM Resources at end of 
chapter), and all use a very similar template that involves asking 
three main questions followed by a subset of specific questions 
associated with a particular type of question (e.g., therapy) or study 
design (e.g., cohort study). These questions include: 
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1. Are the results of the study valid? 

2. What are the results? 

3. Will the results help in caring for my patients? Are the results 
applicable or generalizable to my patient population? 


For each main question, a number of publicly available EBM 
resources (e.g., JAMA User’s Guides, Clinical Evidence-Based 
Medicine (CEBM, Cochrane Collaboration) provide checklists, 
templates, and worksheets to help health professionals and stu¬ 
dents learn how to effectively appraise the scientific literature in 
relation to its validity and applicability. In the section below, exam¬ 
ples of the types of questions that should be addressed during the 
appraisal process are provided. This is not an inclusive list but an 
overview of the types of questions you would expect to answer 
when appraising an article. References throughout the chapter and 
in the reference section provide readers with some of the key 
resources that should be used in the process of critical appraisal. 


I. Are the results of the study valid? 

The following questions are relevant for the appraisal of all 

research studies. 

i. Why is the research being conducted? 

a. Is a brief background or context provided as to why 
the study was conducted? 

b. What is the study about? 

ii. What is the research question being addressed? 

a. Is there a hypothesis being tested? 

b. Is the question described in a PICO format? 
(Population, Intervention, Control, Outcome) 

c. If, after I conduct a methodological assessment, the 
results are valid, are they applicable to my question, 
my patient or patient population? If yes, keep read¬ 
ing if no move to another paper. 

iii. What type of study has been conducted? 

a. Primary studies present original research, while sec¬ 
ondary research summarizes or integrates primary 
research. A brief descriptor of the main types of 
studies and their objective is provided in Table 2. 

iv. Was the research study design appropriate to the type of 

question? 
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a. The clinical area and/or type of question will 
normally inform the appropriate choice of study 
design. Table 3 provides some examples to illustrate 
these choices. 

For each type of study question and/or study design, 
a set of questions has been developed to assess the 
validity of the study methods. These questions help 
to assess whether selection biases (e.g., the groups 
being compared are different), or information biases 
(e.g., ascertainment of exposure status) exist as well 
as to determine the level of confounding that exists 
and how the authors have chosen to adjust for it. 

v. Do the methods used increase the validity of the results? 
Broad questions for each study design include: 

a. Systematic reviews and/meta-analysis—search 

details, comprehensivness and rigor of review, qual¬ 
ity assessment, appropriate synthesis of results, 
heterogeneity 

b. Randomized controlled trials—success of the ran¬ 
domization process (e.g., evidence of allocation con¬ 
cealment, equal groups), follow-up of patients, 
blinding, statistical analysis (e.g., ITT, per protocol), 
groups treated equally other than intervention 

c. Cohort studies—recruitment of the cohort (e.g., is 
it representative of a defined population), the mea¬ 
surement of the exposure and outcome (e.g., subjec¬ 
tive or objective measures), blinding (e.g., of the 
assessor), confounding (e.g., restriction, multivari¬ 
ate modelling, sensitivity analysis), loss to follow-up 

d. Case-control studies—recruitment of cases (e.g., case 
definition, representative, prevalent vs. incident, suf¬ 
ficient sample size) and controls (e.g., representative, 
sufficient sample, matched), exposure ascertainment 

e. Diagnostic studies—reference standard, disease sta¬ 
tus (e.g., level of severity), blinding. 

II. What are the results? 

i. What are the main results of the study? How are they 
presented? (e.g., Relative Risk, Odds Ratio, Hazard 
Ratio, % change, mean difference, sensitivity, specificity, 
likelihood ratios, Number Needed to Treat). 

ii. Is the analysis appropriate to the study design? 
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iii. Are the results statistically significant? (e.g., ^-values, 
Confidence Interval (Cl)) 

iv. What is the treatment effect? Strength of effect? 
a. How precise is it? (e.g., width of Cl) 

v. Have the results been adjusted for confounding? 
(e.g., crude and adjusted analysis) 

vi. Have drop-outs or lost to follow-up been accounted 
for? (e.g., ITT, per protocol analysis, sensitivity 
analysis) 

vii. Do you believe the results? Could they be due to 
chance, bias or confounding? 

viii. Do the results suggest a causal relationship? 

a. Guidelines have been developed to help assess 
the likelihood of a cause-effect relationship (see 
Assessing Causation below) 

viii. Are you concerned about publication bias? 

III. Are the results from the study applicable/relevant to my 
research question, patient or population of interest? 

i. Can the results (or test) be applied to my patient/ 
local population? (e.g., similar socio-demographic, 
health status, gender, age, country, health system) 

a. Are the results statistically significant and/or 
clinically significant? 

ii. Were all relevant outcomes included in the study? 

iii. Do the benefits outweigh the harms (if any)? 


Table 2 

Study design and its major objective 


Study design 

Major objective 

Meta-analysis 

To provide an overall summary statistic of multiple primary studies using 
an a priori protocol and integration of quantitative data from studies 
identified by a systematic review 

Randomized Controlled 
Trial 

To study the efficacy of a treatment or intervention 

Cohort Study 

To study prognosis, natural history of a disease or causation 

Case-control Study 

To identify potential causal factors for a disease or to study adverse 
effects 

Cross-sectional Studies 

To determine the prevalence of disease or risk factors 
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Table 3 

The relationship between clinical area/type of question and research study 


Clinical area 

Type of question 

Research study 

Diagnosis 

What disease is responsible 
for the abnormal findings? 

Prospective, blind comparison to a gold standard 
Cross-sectional study 

Therapy 

What therapy is appropriate 
for a disease? 

RCT 

Prospective cohort 

Prognosis 

What are the expected outcomes 
of a disease? 

Longitudinal studies 

Retrospective/prospective cohort studies 

Prevention 

How can a disease be prevented 
or delayed? 

RCT 

Cohort 

Case-control 

Case series 

Harm 

What intervention or other factor 
may be contributing to a disease? 

RCT 

Cohort 

Case-control/Case series 


4.1 Assessing Knowing what causes a disease or adverse outcome may be critical 

Causation for understanding how to prevent, diagnose, treat or provide a 

prognosis. According to the Oxford Dictionary, a cause is defined 
as “something that gives rise to an action, phenomenon or condi¬ 
tion” [14]. In the mid 1900s, Austin Bradford Hill and Richard 
Doll, who were responsible for the seminal studies on smoking and 
lung cancer, developed a guide to assess the causal relationship 
between an exposure and an outcome [8]. This is not a list of cri¬ 
teria or rules that have to be met, but a guide to help examine the 
strength of the available evidence in the context of a causal rela¬ 
tionship between an exposure and an outcome (Table 4). 


5 Concluding Remarks 

The above types of questions and suggested resources will help to 
support critical appraisal of the scientific literature. These are the 
tools needed to systematically assess the validity, usefulness and 
relevance of available evidence. Evidence-based medicine has 
become synonymous with evidence-based health care or evidence- 
based practice. Critical appraisal is an important, albeit, one step in 
this process. Understanding the strengths, weaknesses and quality 
of study designs, and their inherent ability to provide high grade 
evidence for health interventions, will inform evidence-based 
decision-making and evidence-based practice. 
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Temporality 

Exposure precedes disease 

Experimental evidence 

Evidence from true experiments 

Strength 

Exposure strongly associated with disease frequency 

Biological gradient 
or dose-response 

More exposure associated with higher disease frequency or severity 

Consistency 

The association is observed by different persons in different places 
during different circumstances 

Coherence 

The association is consistent with the natural history and epidemiology 
of the disease 

Biologic plausibility 

Causation is consistent with biological knowledge of the time 

Specificity 

One cause leads to one effect 

Aialogy 

Cause and effect relationship has been established for a similar risk 
factor or disease 
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Evidence-Based Decision-Making 2: Systematic 
Reviews and Meta-analysis 

Aminu Bello, Natasha Wiebe, Amit Garg, and Marcello Tonelli 

Abstract 

The number of studies published in the biomedical literature has dramatically increased over the last few 
decades. This massive proliferation of literature makes clinical medicine increasingly complex, and informa¬ 
tion from multiple studies is often needed to inform a particular clinical decision. However, available 
studies often vary in their design, methodological quality, populations studied and may define the research 
question of interest quite differently, which can make it challenging to synthesize their conclusions. In 
addition, since even highly cited trials may be challenged over time, clinical decision-making requires 
ongoing reconciliation of studies which provide different answers to the same question. Because it is often 
impractical for readers to track down and review all the primary studies, systematic reviews and meta¬ 
analyses are an important source of evidence on the diagnosis, prognosis, and treatment of any given 
disease. This chapter summarizes methods for conducting and reading systematic reviews and meta-analyses, 
as well as describing potential advantages and disadvantages of these publications. 

Key words Meta-analysis, Systematic review, Literature synthesis, Random effects, Forest plot 


1 Introduction 


The number of studies published in the biomedical literature has 
dramatically increased over the last few decades—there are now 
over 21 million citations in MEDLINE from 1964 to 2014 with 
nearly 4,000 new citations added daily [1]. These citations are 
from over 5,600 journals worldwide in about 40 languages (~93 % 
published in English) [1]. This massive proliferation of literature 
makes clinical medicine increasingly complex, and information 
from multiple studies is often needed to inform a particular clinical 
decision [2]. However, available studies often vary in their design, 
methodological quality, populations studied and may define the 
research question of interest quite differently, which can make it 
challenging to synthesize their conclusions. In addition, since even 
highly cited trials may be challenged over time [3], clinical 
decision-making requires ongoing reconciliation of studies which 
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provide different answers to the same question. Because it is often 
impractical for readers to track down and review all the primary 
studies [4], review articles are an important source of summarized 
evidence on the diagnosis, prognosis, and treatment of any given 
disease. 

Review articles have traditionally been written as “narrative 
reviews” where a content expert provides a personal interpretation 
of available evidence [5-9]. Although potentially useful, a narrative 
review typically uses an implicit process to compile evidence to 
support the statements being made. The reader often cannot tell 
which recommendations were based on the author’s unsubstanti¬ 
ated clinical experience versus published clinical studies, and the 
reasons why some studies were given more emphasis than others. 
It is possible some narrative reviews preferentially cite evidence 
that reinforces the preconceived opinions of the authors on the 
topic in question. In addition, narrative reviews generally do not 
provide a quantitative summary of the literature. 


2 How Do Systematic Reviews Differ from Narrative Reviews? 

In contrast, a systematic review uses an explicitly defined process to 
comprehensively identify all studies pertaining to a specific focused 
question, appraise the methods of the studies, summarize their 
results, identify reasons for different findings across studies, and 
cite limitations of current knowledge [10-14]. Unlike a narrative 
review, the structured and transparent process used to conduct a 
properly done systematic review allows the reader to gauge the 
quality of the review process and the potential for bias. Meta¬ 
analyses usually combine the aggregate level data reported in each 
primary study (point and variance estimate of the summary mea¬ 
sure). On occasion a review team will obtain individual patient data 
from each of the primary studies [15-20]. Although some authors 
consider a meta-analysis the best possible use of all available data— 
others regard the results with skepticism and question whether 
they add anything meaningful to scientific knowledge. 


3 Why Are Systematic Reviews and Meta-analyses Clinically Relevant? 

There are key advantages of systematic reviews and meta-analyses 
in clinical decision-making as compared to the traditional narrative 
reviews [21-23]: 

1. Providing robust data for clinical decisions: Reading a well- 
conducted systematic review is an efficient method by which to 
learn about all previous studies on a given topic, and why some 
studies may differ from others in their results (a finding referred 
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to as heterogeneity among the primary studies). Such evidence 
summaries often form the knowledge base used to support 
clinical decisions, evidence-based practice guidelines, economic 
evaluations, and future research agendas. The assessment of 
the expected effect of an intervention or exposure provided by 
a systematic review can be integrated with information about 
other relevant treatment options, patient preferences and 
health care system factors. Reviewing the evidence summary 
allows the reader to establish whether the scientific findings are 
consistent and valid across populations, settings, and treatment 
variations, and whether findings vary significantly by particular 
subgroups. 

2. Minimizing bins'. Meta-analysis and systematic reviews over¬ 
come some of the biases and natural variation inherent with 
small studies where results may not be robust against chance 
variation—especially for small treatment effects. Further, the 
predefined and explicit methodology of a systematic review 
includes steps to minimize bias in all parts of the process: iden¬ 
tifying relevant studies, selecting them for inclusion, and 
summarizing their data, and (for meta-analysis) statistically 
combining data across studies. 

3. Enhancing gcncralizability of findings into practice: Systematic 
reviews overcome the lack of generalizability inherent in studies 
conducted in one particular type of population by including 
many trials conducted in varying populations. Reasons for a 
difference in study findings can also be explored, which can 
yield new insights. 

4. Increasing statistical power : Single studies viewed separately, 
may reach inconclusive results due to relatively small sample 
size and wide confidence intervals. Statistical power and estimate 
precision can be improved with meta-analysis. 


4 How Are Systematic Reviews Conducted? 

A series of guidelines has been published describing how to report 
systematic reviews on therapy [24], screening or diagnosis [25], 
cost-effectiveness [26], or prognosis [27]. In 1996; the Quality of 
Reporting of Meta-Analyses (QUORUM) statement was pub¬ 
lished to improve specifically the quality of reporting meta-analyses 
of RCTs [28]. In the year 2000 there was the publication of Meta¬ 
analysis of Observational Studies in Epidemiology (MOOSE) 
guidelines for reporting systematic reviews of observational studies 
[29]. These statements include a checklist, which describes the 
preferred way to report a systematic review/meta-analysis. 
Recently, these guidelines have been updated by the publication of 
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PRISMA (Preferred Reporting Items for Systematic reviews and 
Meta-Analyses), which have been updated to address several 
conceptual and practical advances in the science of systematic 
reviews [30], 

Conducting a systematic review/meta-analysis involves a number 
of steps that start with protocol development and research ques¬ 
tion formulation, design and study selection criteria, followed by 
retrieval of potentially relevant studies, selection of those studies to 
be included and evaluation of study risk of bias [21, 31, 32] 
(Table 1). Thereafter, the actual meta-analysis is performed and 
the primary studies evaluated for heterogeneity (qualitative and 
quantitative). Finally the results are evaluated for reproducibility 
(sensitivity testing) to ensure that bias did not influence the result, 
and implications for practice and/or policy [12, 33, 34] (Table 1). 


Table 1 

Steps to undertake a systematic review 


Step 

Description 

Defining a research 
question 

The problems to be addressed by the review should be identified, with the 
objectives of the review clearly stated. A prospective protocol defines the 
populations, inclusion/exclusion criteria, interventions, study designs, 
and outcomes 

Literature search 

The published and unpublished literature should be carefully searched for 
relevant studies required to answer the research question of interest. A 
professional librarian should help to design the search, where possible 

Study selection 

Once all possible studies have been identified, they should be independently 
assessed by two reviewers for eligibility (against inclusion criteria) with 
retrieval of full text papers for those that met the inclusion criteria (or for 
which eligibility cannot be initially assessed). Eligible articles should be 
processed for methodological quality (using a critical appraisal framework) 

Data abstraction 

Of the remaining studies, relevant characteristics and results should be 
abstracted onto a data abstraction form. Some studies will be excluded 
even at this late stage. A list of included studies should then be created 

Synthesis, exploration 
for heterogeneity, 
and reporting 
of the results 

The findings from the individual studies should be aggregated, synthesized, 
and reported—all according to the initially proposed protocol. Deviations 
to or addition from the protocol should be clearly noted and mentioned 
in the report 

Placing the findings 
in context 

The findings from the evidence synthesis should then be put into the 
context of the existing literature. This will address issues such as the 
quality and heterogeneity of the included studies, the likely impact of bias, 
as well as the applicability of the findings to practitioners 


Reproduced with permission from ref. [32] 
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5 How Should the Quality of a Systematic Review or Meta-analysis 
Be Appraised? 


Users of systematic reviews need to assure themselves that the 
underlying methods are sound. Before considering the results, or 
how the information can be appropriately applied in patient care, 
there are few questions that readers can ask themselves when assessing 
the methodological quality of a systematic review [10] (Table 2). 


5.1 Was the Review 
Conducted According 
to a Prespecified 
Protocol? 


It is reassuring if a review was guided by a written protocol 
(prepared in advance) which describes the research question(s), 
hypotheses, review methodology, and plan for how the data will be 
extracted and compiled. Such an approach minimizes the likeli¬ 
hood that the results or the expectations of the reviewing team 
influenced study inclusion or synthesis. Although most systematic 
reviews are conducted retrospectively, reviews and meta-analyses 
can in theory be defined at the time several similar trials are being 
planned or under way. This allows a set of specific hypotheses, data 
collection procedures, and analytic strategies to be specified in 
advance before any of the results from the primary studies are 
known. Such a prospective effort may provide more reliable 
answers to medically relevant questions than the traditional retro¬ 
spective approach [35]. 


5.2 Was the Question Clinical questions often deal with issues of treatment, etiology, 
Focused and Well prognosis, and diagnosis. A well-formulated question usually 

Formulated? specifies the patient’s problem or diagnosis, the intervention or 

exposure of interest, as well as any comparison group (if relevant), 
and the primary and secondary outcomes of interest [36]. 


Table 2 

Assessing the methodological quality of a systematic review 


1. Was the review conducted according to a prespecified protocol? 

2. Was the question focused and well formulated? 

3. Were the right types of studies eligible for the review? 

4. Was the method of identifying all relevant information comprehensive? 

(a) Is it likely that relevant studies were missed? 

(b) Was publication bias considered? 

5. Was the data abstraction from each study appropriate? 

(a) Was the methods used in each primary study appraised? 

6. Was the information synthesized and summarized appropriately? 

(a) If the results were mathematically combined in meta-analysis, were the methods described in 
sufficient detail and was it reasonable to do so? 

Reproduced with permission from ref. [32] 

Adapted from Oxman AD, Cook DJ, Guyatt Users’ Guides to Evidence-based Medicine, How to Use an Overview [10] 
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5.3 Were the “Right" 
Types of Studies 
Eligible 

for the Review? 


Different study designs can be used to answer different research 
questions. Randomized controlled trials, observational studies, 
and cross sectional diagnostic studies may each be appropriate 
depending on the primary question posed in the review. When 
examining the eligible criteria for study inclusion, the reader should 
feel confident that a potential bias in the selection of studies was 
avoided. Specifically, the reader should ask whether the eligibility 
criteria for study inclusion were appropriate for the question asked. 
Whether the best types of studies were selected for the review also 
depends on the depth and breadth of the underlying literature 
search. For example, some review teams will only consider studies 
published in English. There is evidence that journals from certain 
countries publish a higher proportion of positive trials than others 
[37]. Excluding non-English studies appeared to change the results 
of some reviews, but not others [38-40]. Some review teams use 
broad criteria for their inclusion of primary studies (i.e., effects of 
agents which block the renin-angiotensin system on adverse 
cardiovascular outcomes), while other teams use more narrow 
inclusion criteria (i.e., restricting the analysis to only those patients 
with diabetes with kidney failure) [41]. There is often no single 
correct approach. However, the conclusions of any meta-analysis 
which is highly sensitive to altering the entry criteria of included 
studies should be interpreted with some caution [42]. For exam¬ 
ple, two different review teams considered whether synthetic dialy¬ 
sis membranes resulted in better clinical outcomes compared to 
cellulose based membranes in patients with acute kidney injury. In 
one meta-analysis [43], but not the other [44], synthetic mem¬ 
branes reduced the chance of death. The discordant results were 
due to the inclusion of a study which did not meet eligibility for 
the second review [45]. 


5.4 Was the Method 
of Identifying All 
Relevant Information 
Comprehensive? 


Identifying relevant studies for a given clinical question amongst 
the many potential sources of information is usually a laborious 
process [46]. Biomedical journals are the most common source of 
information, and bibliographic databases are used to search for 
relevant articles. MEDLINE currently indexes about 5,600 medical 
journals and contains 21 million citations [1]. As a supplementary 
method of identifying information, searching databases such as the 
Science Citation Index (which identifies all papers which cite a rel¬ 
evant article), as well as newer Internet search engines like Google 
Scholar and Elsevier’s Scirus can be useful for identifying articles 
not indexed well in traditional bibliographic databases [47]. 
Searching bibliographies of retrieved articles can also identify 
relevant articles which were missed. Whatever bibliographic database 
was used, the review team should have employed a search strategy 
which maximized the identification of relevant articles [48, 49]. 
Because there is some subjectivity in screening databases, citations 
should be reviewed independently and in duplicate by two 
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5.5 Was the Data 
Abstraction from Each 
Study Appropriate? 


members of the reviewing team, with the full text article retrieved 
for any citation deemed relevant by any of the reviewers. There is 
also some subjectivity in assessing the eligibility of each full text 
article, and the risk that relevant reports were discarded is reduced 
if two reviewers independently perform each assessment [50]. 

Important sources of information other than journal articles 
should also be considered. Conference proceedings, abstracts, 
books, as well as inquiries to relevant industrial organizations can 
all yield potentially valuable information. Inquiries to experts, 
including protocols listed in trial registries, may have also proved 
useful [51]. A comprehensive search of available literature reduces 
the possibility of publication bias, which occurs when studies with 
statistically significant results are more like to be published and cited 
[52, 53]. It is interesting that some reviews of n-acetylcysteine 
for the prevention of contrast nephropathy analyzed as few as 
five studies, despite being submitted for publication almost 1 year 
after publication of a review of 12 studies [54]. While there are 
many potential reasons for this, one cannot exclude the possibility 
that some search strategies missed eligible trials. In addition to a 
comprehensive search method which makes it unlikely that rele¬ 
vant studies were missed, it is often reassuring if the review team 
used graphical (funnel plot) and statistical methods (Begg test; 
Egger test) to confirm there is little chance that publication bias 
influenced the results [39]. 

In compiling relevant information the review team should have 
used a rigorous and reproducible method of abstracting all relevant 
data from the primary studies. Often two reviewers abstract key 
information from each primary study including study and patient 
characteristics, setting, and details about the intervention, expo¬ 
sure or diagnostic test as is appropriate. Language translators may 
be needed. Teams who conduct their review with due rigor will 
indicate they contacted the primary authors from each of the 
primary studies, to confirm the accuracy of abstracted data as well 
as to provide additional relevant information not provided in the 
primary report. Some authors will go through the additional effort 
of blinding or masking the results from other study characteristics, 
so that data abstraction is as objective as possible [55, 56]. 

Data on the methodological risk of bias of each primary study 
should always be extracted (recognizing this is not always as 
straightforward as it may first seem) [57-62]. The question to be 
posed by the reader is whether the reviewing team considered if 
each of the primary studies was designed, conducted, and analyzed 
in a way to minimize or avoid biases in the results. For randomized 
controlled trials, lack of concealment of allocation, inadequate 
generation of the allocation sequence, and lack of double blinding 
can exaggerate estimates of the treatment effect [61, 63]. The 
value of abstracting such data is that it may help explain important 
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differences in the results amongst the primary studies [61]. For 
example, long-term risk estimates can become unreliable when 
participants are lost to study follow-up—those who participate in 
follow-up often systematically differ from non-participants. For 
this reason, prognosis studies are vulnerable to bias, unless the loss 
to follow-up is less than 20 % [64], 


5.6 How 
Was the Information 
Synthesized 
and Summarized? 


Several types of figures are commonly used to summarize informa¬ 
tion in a systematic review: a flow diagram of study selection 
(Fig. 1) [65], a forest plot depicting individual and most often an 
overall pooled estimate of effect (Fig. 2) [33, 66], and a funnel 



Fig. 1 Example of a PRISMA flowchart for study selection. X represents the number of studies in each category. 
Reproduced with permission from ref. [32] 
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Study or sub-category 

Subgroup 1 
Sub- Subgroup 1 
Study 1 
Study 2 

Sub- Subgroup 2 
Study 3 
Study 4 
Study 5 

Subtotal 

Subgroup 2 
Sub- Subgroup 1 
Study 6 
Study 7 
Study 8 
Study 9 
Study 10 

Sub- Subgroup 2 
Study 11 
Study 12 
Study 13 

Subtotal 

Total 


N 


n 


n 


Effect [95% O] Effect [95%CI] 



-20 -10 0 10 20 


Fig. 2 Example of a forest plot. Each row of a forest plot (also called meta-graph) represents information pulled 
from one study or the total or a subgroup of pooled studies. The marker represents the point estimate of effect. 
The width of the bar represents the 95 % confidence limits. A diamond marker usually indicates the total or a 
subgroup total of pooled results. In this example Y is the point estimate of effect for each trial, and X, Z are the 
95 % confidence limits. To the left of the plot is tabular information for each study, often the sample size of 
each group or study and the numerical point and interval estimates. Reproduced with permission from ref. [32] 


plot showing an assessment of publication bias (Fig. 3). Other 
figures such as a meta-regression plot (Fig. 4) and a network meta¬ 
analysis plot (Fig. 5) are more complex and not commonly used. 
A forest plot contains the individual study point estimates and their 
associated 95 % confidence intervals. Confidence limits may also 
include differences too small to be clinically important [67] and 
should be deemed as neither evidence of an “important effect” nor 
evidence of “no difference” in effect. The forest plot also allows 
one to appreciate the heterogeneity of results, allowing for a visual 
comparison of the point estimates and 95 % confidence intervals 
for the effect of each study, and the overall pooled result. A funnel 
plot is a simple scatter plot of each study’s precision (inversion of 
















T 


T 


T 


Effect estimate 

Fig. 3 Example of a funnel plot. Each study’s precision (the inverse of the 
standard error of each study’s effect estimate) is plotted against each study’s 
effect estimate. These markers are sized according to the study’s sample size; 
larger studies are marked with larger circles. A vertical line is drawn through our 
overall pooled estimate of effect to aid the eye in detecting symmetry (an inverted 
funnel) or asymmetry. This funnel plot appears mildly asymmetric. The emptier 
right side of the inverted funnel may indicate small missing studies. Reproduced 
with permission from ref. [32] 



Fig. 4 Example of a meta-regression plot. Each primary study’s estimate of effect is plotted against a variable 
that may potentially modify the relationship between outcome and intervention (or exposure). The markers 
(circles) are sized according to precision—the inverse of the standard error of each study’s effect estimate. 
The three lines are the fitted (solid) and the upper and lower bounds (dashed) of the 95 % confidence intervals. 
Reproduced with permission from ref. [32] 
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Fig. 5 Example of a network meta-analysis figure. Reproduced with permission from ref. [32] 


standard error) on the y -axis against each study’s effect on the 
a;- axis. Because small studies have less precision and large studies 
have more, scatter should form an inverted funnel when there are 
no systematic missing studies. A line is often drawn through the 
overall pooled effect to aid the eye in detecting symmetry (an 
inverted funnel) or asymmetry. Asymmetry suggests missing evi¬ 
dence—often small unpublished studies. 

The meta-regression plot is not widely used (Fig. 4). Because 
the unit of analysis in this form of regression is the study rather 
than the participant or patient, the analysis is typically underpow¬ 
ered and statistical significance is rare. The meta-regression plot is 
both a scatter and line plot. Each study’s estimate of effect is plot¬ 
ted against the value of the potential modifier and a regression line 
is drawn through the scatter of observations. A slope indicates the 
direction and whether there is an association between the potential 
modifier and the effect estimate. In publications of network meta¬ 
analyses, due to multiple intervention groups and the complexity 
therein, a figure depicting the number of comparisons between 
each set of interventions is usually provided (Fig. 5). Often a matrix 
of direct and mixed evidence for each comparison is reported, 
rather than a forest plot showing the pooled and each individual 
study’s estimate. The matrix, a square table with the intervention 
labels running along the center diagonal, allows one to appreciate 
what direct evidence is absent and where the mixed evidence does 
not agree with the direct evidence (Fig. 6). The row-column cell of 
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Intervention 6 




Intervention S Y(X,Z) 


VRZ) 


Y%1) 


V (X,Z) 


Y<X,Z) 


Intervention 4 Y IX.Zt 


Y(X,Z) 


Y(X,Z) 


Y|X,Z) 


Intervention 3 

Y(X,Z) 

Y (X,Z} 

Y(X,Z) 


Y£X,Z) 

Intervention 2 

Y(X,Z) 

Y{X,Z> 


Y(X,2) 


intervention 1 


Fig. 6 Example of a direct and mixed evidence matrix. The mixed evidence appears in the shaded upper 
triangle and the corresponding direct evidence appears in the unshaded lower triangle. A dash indicates no 
available direct evidence (meaning that trials specifically comparing this pair of treatments were not identified 
in the review). In this example Y is the point estimate of effect for each trial, and X, Z are the 95 % confidence 
limits. Reproduced with permission from ref. [32] 

the available direct evidence corresponds to the column-row cell of 
the mixed evidence. A good understanding of the common statisti¬ 
cal terms used in meta-analyses is also important (Table 3). 

However, these measures are not always required in cases 
where the primary studies are very heterogeneous (differ in the 
design, populations studied, interventions and comparisons used, 
or outcomes measured). In those situations, it may be more appro¬ 
priate to simply report the results descriptively using text and 
tables. When the primary studies are similar in characteristics, and 
the studies provide a similar estimate of a true effect, then meta¬ 
analysis may have been used to derive a more precise estimate of 
this effect [68]. In meta-analysis, data from the individual studies 
are not simply combined as if they were from a single study. Rather 
greater weights are given to the results from studies that provide 
more information, because they are likely to better reflect the true 
effect of interest. Quantitatively, the calculation of a summary 
effect estimate can be accomplished under the assumption of 
“fixed” effects or “random” effects model. Although a thorough 
description of the merits of each approach is described elsewhere 
[69], it is fair to say that a random effects model is more conserva¬ 
tive than the fixed effects approach, and that a finding which is 














Table 3 

Common statistical terms used in meta-analyses 


Systematic Reviews and Meta-analysis 


409 


CM 

II 

O 

C/5 

ca 

o 

E 

o 

ro 

o 

o 

o 

ro 

o 


ro 

o 

E 

CO 

CO 

ro 

co 

E 

ro 

x 

o 

"E 

o 


<u 


<D 

JP 

4-4 

<D 

JP 

4-4 

<D 

Jp 

4-4 

Jp 

4-4 

Jp 

risk 


Vi 

Jp 

Vi 

4-4 

• 7 H 


<D 

<D 

4-4 

• ^H 

(U 

s 

Jp 


a * 


c/3 


>• Oh 


9-* 


<U 

<U 

JP 


c/3 
4-t 

p 

7 Oh 
Oh o 

'C 04 
•p 

7 5-1 
Ph 


p 

o 

u 


v> 


bfj 

.a 

a >■> 

o Oh 

g 

<D O 

a, op 

O 

o 

J-H 


Oh 

o SP 


<L> 

JP 


U 

P 
<D 

g ’d 

• tn <U 
£ Oh 

g <D 
7 Oh 

7 o 

Oh c/3 

U 3 
•4P ’p 
u O 
7 

Ph 


O 

a 


o 

P 

7 

O 

> 

7 

Oh 

O 5 

o 

Pp 

Oh 

• i-H 

u 

O 

PP 

• i-H 

ej 

• i-H 

4-4 

<N 

PP 

7 

*+p 

7 

J-H 

7 

Oh 

7 

JP 

J-H 

7 

JP 

JP 




c/3 

4-t 

p 

7 

O. 

»i—< 

u 


MH 

7 

p 

*—M 

7 

7 

»—M 

7 

J-H 

J-H 

7 

J-H 

JP 

J-H 

7 

<D 

OP 

<D 

4-4 

<D 

Oh 

Jp 

4-4 

JP 

<u 

Jp 

P 

7 

JP 

4-4 

"cd 

<D 

a 

4-4 

"cd 

a 

o 

4-4 

"cd 

4-4 

p 

O 

ej 

4-4 

4-4 

p 

u 

4-4 

4-4 

p 

4-4 

<D 

7 

0> 

(U 

p 

0> 

g 

p 

g 

o 

g 

P 


o 


<D 


O 

J-H 


J-H 

J-H 

ej 

<D 

<D 

<L> 

JP 

<D 

4-4 


Oh 
X 
<U 

<U 

JP 

>, P 

£;5 

u £ 

Jp 


o 9 
+-* jp 

■8 | 

« 'I 

C/3 , 

9 a 

Jp 7 

£ -P 

CP 

<J S-H 

• P <U 

t: jp 

a cd 

JO 

O >, 
> JO 

n 7 
^ j-h 

o a 

a g 

o — 

ej 7 

4-1 4-t 

P 

<L> 


O 


O 

J-h 

4-t 

P 

o 

u 


Vi 

4-t 

p 

cd 

Oi 

• i—i 

ej 

• i—( 

4- » 

5- H 

cd 

Ph 


P 

O 

j-h 

O 

o 


j-h 

<U 


<D 

JP 

4-i 

bO >> 
P Oh 

U *H 

P 9 

a JP 

"P ,M 
a 7? 
Oh g 
X p 
a 7 

Oh O 
O CJ 


Oh Oh 

OJ ^ 

c u 
o 9 

S jp ^ 
> + 

a 1 

P^ ’Tj jp 

o 0) H—» 


j-h 

<u 

Tj 

j-h 

O 


Oh 

cd 

j-h 

a 


cd 

a 

J-H 

4-t 

a 

JP 


O 

J-H 

4-t 

P 

o 

u 


V 

u 

O 


Vi 


V 

JP 


Vi 

4—* 

P 

cd 


P JP 
P '*- J 
P *7 

<N 

T3 
cd 

H ^ 

o::2 

cd O 

J-H VH 

<U cd 
JP Oh 
+-* _ 
—i P 

cd cd 

P JP 
P -M 

a o 

I s 

^ o 


o 


u 


JO 

^ p 

<u o 

a ° 

Jp 7 
Jp Oh Oh 

•s 0 2 

> U U 

M ’o 

J-H 

7 




cd O 

Oh > 


U 

• i—< 

H—t 

J-H 

cd 

Ph 


P 

O 


P 

O 

u 


Vi 

O 


u 

P 7 

<d a 


J-H 


J-H 

& 

gp 


> 

Vi 

<D 

^ J-H 

O. <D 

-S' Jp 

7 ^ 

O sJ 
cn 

P UP 

a <n 

p 

cd 
JP 

■M 

J-H 

flj 


<D 


<D 

JP 


O 

O 

P 

cd 

• f-H 

J-H 

cd 

> 


Oh 

O 

g 

cd .32 

<D 

> 

■ i-H 

-M 

cd 

u "53 

I- 

Ph cd 


Vi 

<D 


<D 


cd 


JP JP 

■M 

P 


O cd 

P P 

T3 

Vi 


Vi 

Jp 

U 

P 


Vi 

■ 1 -H 

<D 

Pd "cd 
■4—* > 

2 £ 

O Jp 

Jp ^ 

+-* CJ 
Oh U 

O P 

<N 


J-H 

<D 

qp 
2 " 'p 

8 I 

flj _, 

bfj 
O 

J-H 

<D 

■M 

O 

JP 

'P‘ 

P 

■M 

Vi 

(D > 
(D ^ 

>■ C/5 

5 .a 
a 

JP 7 

■4—* 

oo 


<D 

• ^H 

’U 

P 

■M 

i/) 

C/5 

cd 

<D 

bJD 

J-H 

cd 

cn 

cd 

C/5 

(D 


O 

> <D 

£ ^ 
'4—' 

O Ph 
P o 


P 4-H 

v 7 

'P O 

V c* 
Oh ^ 

O ^ 


cd 


O 
<J 

P 

V 
J-H 
.<L> 

a 


p 

<D 

-4-H 

o 

JO 


9 3 

(D O 

E •£ 

• i-H 

cd > 

"3 V5 

-P ^ 

S a 

<p 

• ^H 

PP 
O 

a 


cn 

p 

cd 

JP 


7 

o 

p 

7 

• H 
4-4 

p 

ej 

4-4 

7 

"cd 

• i-H 

• H 
4-4 

p 

<D 

4-4 

O 

4-4 

P 

<D 

4-4 

o 

O 

<D 

o 

Oh 

JP 

4-4 

o 

Oh 

<D 

'M 

Oh 

<D 

JP 

bD 

JP 

4-4 

Jp 

P 

• i-H 

o 

O 

JP 

4-4 

4-4 

Jp 


cd 

<U 

J-H 

bJD 

C/5 

-4-4 

> i-H 

P 

P 


Vi 

<D . 

• ■—! cd 

’a -p 

p ^ 

-4-4 

oo 


cd § 

? s 

4-1 P5 


o 

CO 

o 


O 


CO 

ro 

o 


a <d 

jp +- 1 

^ J-H 

>. a 

X. -s 

pp o 

<u p 
pp cd 

• P ’—H 

' 0 

JO 

p w 


o 

J-H 

bD 

"a 

4-4 

p 

<D 

a 

• ^H 
J-H 

<D 


Oh 

P 

O 

Jh 

bJD 

'o 

J-H 


^ a 

Oh g 
tH ° 

u o 4 
a -2 _q 


Oh 

O 


Oh 

O 


k> cn 


^ '7 cO 

C/D SH M-H 


3 




a 204 

a ^ 

- O o 

<D J-H 

tl bO bo 




JP 

O 

J-H 

JO 

J-H 



cd 

J-H 

C/3 

X5 

Pp 

o 


Ph 

O 




PP 

<D 

PP 

<D 

<D 

P 

J-H 

<D 

JP 


P 


H 

z 

z 


cd 

<D 

J-H 


p 

cd 

<D 


Q 


<D 

U 

P 

<u 

J-H 

eg 

O^ 


4—1 

- 

P cd 

JO 


Vi 

p 

<D 

<D 


<D 


o> 

* ^H 
4-4 
J-H 

cd 

JO 


JD <L> 
„ (L) 


O (D 
P JP 


V 

u 

p 

cd 


PP 
<D 

Vi 

. O 
^ Oh 
> Oh 


cd 

• ^H 
4-4 

p 

<D 

4-4 

o 

Oh 

<D 

JP 


<D 

JP 

4-4 

Oh 

O 

-J 

C/5 


(D 

O 

P 

cd 

» i-H 

J-H 

cd 

> 


Oh 

O 

4-4 

p 

<D 

04 

J-H 


cn 

cd 

(D 

o 

p 


O cd 
Oh 


<D 

JP 

H 


J-H 

cd 

> 


• hh 

s 

§4 

I 

I I 

• hH 

•^H 

8 


J-H 

<D 

op 

• i-H 

PP 

O 


<D 

> 


cd . 
Ij 

J-H 


Vi 

<D 

• i-H 

PP 

P 

4-4 

cr 

Om 

o 


(D 

JP 


cd 

^H 

4-4 

p 

0> 


p J4 

4-4 C/5 

cr *7 

.a u 

3 .a 

4-H 4-4 

m % £ 
<U 

J-H J-H 

O qj 
^ JP 

H 


o 

Oh 

O 

^ JP 

P3 ^ 
V g 
Pp g 
*7 O 
.> JP 

a l 
s - 

pp 

2 ^ 

E H 


cd 

• H 
4-4 

p 

s J 

a 4-i 
Oh c_i—, 

flj O 

*2 v 
Jp p 

4-4 O 

•7 g 
^ eg 

c/3 Oh 

g ^ 

"p C 
P 7 

c« V 

Oh P 

O G 

a a 

O Oh 

P 


O 

J-H 

g 

Oh 


cn 

P 

P 


p o 

7 Op 
<D OP 

S ^ 

J 3 

H 


CN 



J-H 

<D 

^p 

• H 

PP 

O 

a 

Pd 

• ^H 
4-4 

p 

<D 

4-4 

o 

Oh 

V 

JP 


P 

o 

Jp 


C/3 

o 

• H 

PP 

P 

4-4 

C/3 


<D 

ej 

P 

<D 

J-H 

g 

OP 


O P 

a S o 

S Eg 

g O ^ 
Oh O 


P 


<N 

ro 


Oh 

<U 

J-H 

a 

o 

eP 

P 

o 

• i-H 

Vi 

Vi 


J-H 

<L> 

Oh 


PP 

<D 

O 

P 

PP 

0 

J-H 

Oh 

3 

g 





410 


Aminu Bello et al. 


statistically significant with the latter but not the former should be 
viewed with skepticism. Whenever individual studies were pooled 
in meta-analysis, it is important for the reader to determine whether 
it was reasonable to do so. One way of determining whether the 
results are similar enough to pool across studies is to inspect the 
graphical display of the results. 

Some review teams may also report a statistical test for hetero¬ 
geneity [70], to help prove that primary study results were no 
different than what would have been expected through statistical 
sampling. The most commonly used technique for quantification 
of heterogeneity is the Q statistic, which is a variant of the chi- 
square test. Although a nonsignificant result (by convention a 
^>0.1) is often taken to indicate that substantial heterogeneity is 
not present, this test is statistically underpowered, especially when 
the number of studies being pooled is small. The magnitude of the 
heterogeneity can be quantified with a new statistic referred to as 
the T 2 , which describes the percentage of variability beyond that 
expected by statistical sampling. Values of 0-30 %, 31-50 %, and 
greater than 50 % represent mild, moderate, and notable heteroge¬ 
neity respectively [71]. As mentioned above, a careful exploration 
of the sources of heterogeneity can lead to new insights about 
mechanism or subgroups for future study. 


6 What Are the Strengths of Systematic Reviews and Meta-analyses? 

Physicians make better clinical decisions when they understand the 
circumstances and preferences of their patients, and combine their 
personal experience with clinical evidence underlying the available 
options [72]. The public and professional organizations (such as 
medical licensing boards) also expect that physicians will integrate 
research findings into their practice in a timely way. Thus, sound 
clinical or health policy decisions are facilitated by reviewing the 
available evidence (and its limitations), understanding reasons why 
some studies differ in their results (a finding sometimes referred to 
as heterogeneity amongst the primary studies), coming up with an 
assessment of the expected effect of an intervention or exposure 
(for questions of therapy or etiology), and then integrating the 
new information with other relevant treatment, patient and health 
care system factors. Therefore, reading a properly conducted 
systematic review is an efficient way of becoming familiar with the 
best available research evidence for a particular clinical question. 

In cases where the review team has obtained unpublished 
information from the primary authors, a systematic review can also 
extend the available literature. The presented summary allows the 
reader to take account a whole range of relevant findings from 
research on a particular topic. The process can also establish 
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whether the scientific findings are consistent and generalizable 
across populations, settings, and treatment variations, and whether 
findings vary significantly by particular subgroups. Again, the real 
strength of a systematic review lies in the transparency of each 
phase of the synthesis process, allowing the reader to focus on the 
merits of each decision made in compiling the information, rather 
than a simple contrast of one study with another as in other types 
of reviews. 

A well-conducted systematic review attempts to reduce the 
possibility of bias in the method of identifying and selecting studies 
for review, by using a comprehensive search strategy, and specifying 
inclusion criteria which ideally have not been influenced by knowl¬ 
edge of the primary studies. If this is not done, bias can result. For 
example, studies demonstrating a significant effect of treatment are 
more likely to be published than studies with negative findings, are 
more likely to be published in English, and are more likely to be 
cited by others [42, 73-76]. Therefore, systematic reviews with 
cursory search strategies (or those restricted to the English lan¬ 
guage) may be more likely to report large effect sizes associated 
with treatment. 

Mathematically combining data from a series of well-conducted 
primary studies may provide a more precise estimate of the under¬ 
lying “true effect” than any individual study [51]. In other words, 
by combining the samples of the individual studies, the overall 
sample size is increased, enhancing the statistical power of the anal¬ 
ysis and reducing the size of the confidence interval for the point 
estimate of the effect. Sometimes, if the treatment effect in small 
trials shows a nonsignificant trend towards efficacy, pooling the 
results may establish the benefits of therapy. For example, ten trials 
examined whether angiotensin converting enzyme (ACE) inhibi¬ 
tors were more effective than other antihypertensive agents for the 
prevention of nondiabetic kidney failure [77]. Although some of 
the individual trials had nonsignificant results, the overall pooled 
estimate was more precise, and established that ACE inhibitors are 
beneficial for preventing progression of kidney disease in the target 
population. For this reason, a meta-analysis of well-conducted ran¬ 
domized controlled trials is often considered the strongest level of 
evidence [78]. Alternatively, when the existing studies have impor¬ 
tant scientific and methodological limitations including smaller 
sized samples, the systematic review may identify where gaps exist 
in the available literature. In this case an exploratory meta-analysis 
can provide a plausible estimate of effect that can be tested in sub¬ 
sequent studies [79, 80]. Ultimately, the effect estimates obtained 
from systematic reviews are more likely to prove robust in larger 
multicenter randomized controlled trials than other forms of med¬ 
ical literature, including animal experiments, observational studies, 
and single randomized trials [81, 82]. 
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7 What Are the Limitations of Systematic Reviews and Meta-analyses? 

Although well-done systematic reviews and meta-analyses have 
important strengths, they also have potential limitations. First, the 
summary provided in a systematic review and meta-analysis of 
the literature is only as reliable as the methods used to estimate the 
effect in each of the primary studies. In other words, conducting a 
meta-analysis does not overcome problems inherent in the design 
and execution of the primary studies. Meta-analysis also does not 
correct biases due to selective publication, where studies reporting 
dramatic effects are more likely to be identified, summarized, and 
subsequently pooled in meta-analysis than studies reporting smaller 
effect sizes. Since more than three quarters of meta-analyses did 
not report any empirical assessment of publication bias, the true 
frequency of this form of bias is unknown [83]. 

Controversies also arise about the interpretation of summa¬ 
rized results, particularly when the results of discordant studies are 
pooled in meta-analysis [84]. The review process inevitably identi¬ 
fies studies that are diverse in their design, methodological quality, 
specific interventions used, and types of patients studied. There is 
often some subjectivity when deciding how similar studies must be 
before pooling is appropriate. Combining studies of poor quality 
with those which were more rigorously conducted, may not be 
useful, and can lead to worse estimates of the underlying truth, or 
a false sense of precision around the truth [84]. A false sense of 
precision may also arise when various subgroups of patients defined 
by characteristics such as their age or sex differ in their observed 
response. In such cases reporting an aggregate pooled effect might 
be misleading, if there are important reasons to explain this hetero¬ 
geneity [84-87]. 

Finally, simply describing a manuscript as a “systematic review” 
or “meta-analysis” does not guarantee that the review was con¬ 
ducted or reported with due rigor [32, 35]. Important method¬ 
ological flaws of systematic reviews published in peer reviewed 
journals have been well described. The most common flaws are 
failure to assess the methodological risk of bias of included primary 
studies, and failure to avoid bias in study inclusion. In some cases, 
industry supported reviews of drugs have expressed fewer reserva¬ 
tions about methodological limitations of the included trials than 
rigorously conducted Cochrane reviews on the same topic. 
However, the hypothesis that less rigorous reviews more often 
report positive conclusions than good quality reviews of the same 
topic has not been borne out in empirical assessment. Nonetheless, 
like all good consumers, users of systematic reviews should care¬ 
fully consider the quality of the product, and adhere to the dictum 
“caveat emptor”: let the buyer beware. These limitations may 
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explain differences in the results of meta-analyses as compared to 
subsequent large randomized controlled trials, which have occurred 
in about a third of cases [82], 


8 Summary 

Like all research, systematic reviews and meta-analyses have both 
strengths and weaknesses. When performed with methodological 
rigor, they enhance our understanding of the information available 
from all relevant studies on a given topic. Many of the perceived 
limitations of meta-analysis are not inherent to the methodology, 
but represent deficits in the conduct or reporting of individual 
primary studies. With the massive proliferation of clinical studies 
and limited time available for users to fathom the literature, sys¬ 
tematic reviews will help to guide clinical decision-making and 
policy. To maximize their potential advantages, it is essential that 
future reviews be conducted and reported properly, and be inter¬ 


preted judiciously by 
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Evidence-Based Decision-Making 3: Health Technology 
Assessment 

Daria O’Reilly, Kaitryn Campbell, Meredith Vanstone, James M. Bowen, 
Lisa Schwartz, Nazila Assasi, and Ron Goeree 

Abstract 

This chapter begins with a brief introduction to health technology assessment (HTA). HTA is concerned 
with the systematic evaluation of the consequences of the adoption and use of new health technologies and 
improving the evidence on existing technologies. The objective of mainstream HTA is to support evidence- 
based decision- and policy-making that encourage the uptake of efficient and effective health care tech¬ 
nologies. This chapter provides a basic framework for conducting an HTA as well as some fundamental 
concepts and challenges in assessing health technologies. A case study of the assessment of drug eluting 
stents in Ontario is presented to illustrate the HTA process. Whether HTA is beneficial—supporting timely 
access to needed technologies—or detrimental depends on three critical issues: when the assessment is 
performed; how it is performed; and how the findings are used. 

Key words Health technology assessment, Health care technology, Economic evaluation, Evidence- 
based decision-making, Health policy 


1 Introduction 


Health care technologies can be described as interventions or 
methods that are used to promote health; prevent, diagnose, or 
treat disease; or improve rehabilitation [1]. Health care technolo¬ 
gies include drugs, biologies, medical devices (e.g., pacemakers), 
medical and surgical procedures, organizational and managerial 
systems (e.g., alternative health care delivery methods), and public 
health programs [1, 2]. Considerable growth in new, innovative 
health care technologies over the last number of years has brought 
remarkable improvements in health gains, quality of life, and the 
organization and delivery of health care. Some health care tech¬ 
nologies have the potential to transform health care and alter 
established ways of delivering health care while improving health 
outcomes in an efficient and cost-effective manner. For example, the 
introduction of angioplasty has resulted in improvements in heart 
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1.1 What Is “Health 

Technology 

Assessment”? 


attack survival over open-heart surgery. These new health care 
technologies often come with a high price tag. As a result, there is 
a challenge of investing in those services that offer the best value 
for money. Decision-makers must find a balance between provid¬ 
ing access to high-quality care and improving health outcomes on 
the one hand and managing health care budgets on the other. 

The difficult decisions surrounding adoption, reimbursement, 
and means of diffusion have resulted in the increased demand for 
information to help make more evidence-based decisions [2]. The 
development and proliferation of many health technology assess¬ 
ment (HTA) producers around the world reflects this demand. 

HTA means different things to different people in different parts 
of the world and has been defined and conducted in a variety of 
ways and thus, it is not possible to provide one clear and compre¬ 
hensive definition [3]. However, the International HTA Glossary 
defines HTA as the “The systematic evaluation of the properties 
and effects of a health technology, addressing the direct and 
intended effects of this technology, as well as its indirect and unin¬ 
tended consequences, and aimed mainly at informing decision 
making regarding health technologies” [1]. HTA is conducted by 
interdisciplinary groups using explicit analytical frameworks and 
may involve the investigation of one or more of the following attri¬ 
butes of technologies: performance characteristics, safety, clinical 
efficacy, effectiveness, cost-effectiveness, social, legal, ethical, and 
economic impacts [2, 4]. The main purpose of HTA is to act as “a 
bridge” between evidence and policy-making. It seeks to provide 
health policy-makers with accessible, useable, and evidence-based 
information to guide their decisions about the appropriate use of 
new and existing technologies and the efficient allocation of 
resources [5,6]. During an assessment, data from research studies 
and other scientific sources are systematically gathered, analyzed, 
and interpreted. The findings from this process are then summa¬ 
rized in reports that translate scientific data into information that is 
relevant to decision-making. HTA has increasingly emerged as a 
tool for informing more effective regulation of the utilization and 
diffusion of health technologies at various levels (e.g., patient, 
health care provider or institution, regional, national, and interna¬ 
tional level) [2]. 

HTA information may be particularly useful in supporting 
decisions when: 

• A technology has high unit or aggregate costs. 

• Explicit trade-off decisions must be made in allocating 

resources among technologies. 

• A technology is highly complex, precedent-setting, or involves 

significant uncertainty. 
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• A proposed provision of a treatment, diagnostic test, or piece 
of medical equipment is innovative or controversial. 

• An established technology is associated with significant varia¬ 
tions in utilization and outcomes [4]. 

HTA is an important analytical tool that is used in an assort¬ 
ment of settings by people with diverse backgrounds, perspectives, 
and goals. Assessments could be conducted in order to differenti¬ 
ate technology value (e.g., a new drug seeking reimbursement on 
a formulary) or an academic study of the health consequences of a 
particular health care practice, such as a randomized trial or a sys¬ 
tematic review of any or all aspects of a particular health care prac¬ 
tice carried out by an HTA agency. However, there is little 
uniformity across Canada and elsewhere concerning HTA meth¬ 
odology and application. No doubt this diversity strengthens the 
results, but it also makes generalization difficult [3]. 

Technology evaluation is not a single process but varies, 
depending on what is being evaluated, by whom, and for what pur¬ 
pose. Generally, the evaluation of a new drug proceeds along differ¬ 
ent lines than that of a medical device or diagnostic test. Furthermore, 
questions involving technology assessment can be local (e.g., a hos¬ 
pital evaluating the need for a new PET scanner), national (e.g., 
evaluation of a new drug), or a combination of both. Although 
there is obviously some variation in how technology assessment is 
performed, some factors are common to the process [7]. 

To be effective, HTA should serve as a bridge between scien¬ 
tific evidence, the judgment of health professionals, the views of 
patients, and the needs of policy-makers. Much is at stake regard¬ 
ing how the results of HTA are interpreted and applied. 

In the following section, we provide a basic framework for 
conducting an HTA. Note that the assessment of health technolo¬ 
gies is an iterative process and some of the steps may not occur 
linearly and some may even overlap. 


2 Basic Framework for Conducting an HTA 


2.1 Identifying 
the Topic for 
Assessment and 
Setting Priorities 


Determining what technologies to assess and setting priorities 
among them can be difficult given that there are more health 
technologies in need of evaluation than there are resources required 
for assessing them. In some instances, the assessment topics may 
already be determined. For example, the Common Drug Review at 
the Canadian Agency for Drugs and Technologies in Health 
(CADTH) conducts objective, rigorous reviews of the clinical, 
cost-effectiveness, and patient evidence for all new drugs. On the 
other hand, many new and existing medical devices and surgical 
procedures have never been assessed and remain unproven. 
Potential technology assessment topics can be identified through 
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2.2 Clear 
Specification 
of the Assessment 
Problem 


2.3 Evaluation 
of Social 

and Ethical Issues 


several sources, including informal surveys of advisory committees, 
health care stakeholders, horizon scanning, or requests from 
interested parties [8]. Procedures for setting priorities among these 
candidate technologies are variable. Some HTA agencies have 
devised processes that use explicit highly systematic and quantitative 
approaches that incorporate some form of priority-setting criteria 
(e.g., multicriteria decision analysis (MCDA)) often accompanied 
by a deliberative process for identifying priorities [2, 8, 9]. Other 
agencies use ad hoc processes. In many instances, the technologies 
that get assessed are those with high costs; that have a high potential 
to improve health outcomes or reduce health risks; that affect a 
large population base; that may be disruptive to the health care 
system; and when there is an imminent need to make reimbursement 
decisions. Having a practical and transparent approach to selecting 
and prioritizing the most important and policy-relevant topics will 
help to efficiently allocate resources available for HTA research [8]. 

Once the topic for assessment has been decided, it is imperative to 
clearly specify the problem(s) or question(s) to be addressed and 
the target audience for the assessment as these will affect every 
aspect of the HTA and the usefulness of the results [2]. Assessment 
problem statements need to consider the patient or population 
affected; the potential social and ethical issues relevant to the pop¬ 
ulation, context, or society in general; the intervention being con¬ 
sidered; what the intervention is being compared to; the relation of 
the new technology to existing technologies; what the outcome(s) 
or interest is/are; and the setting [10]. Additionally, those con¬ 
ducting the assessment should have an explicit understanding of 
the purpose of the assessment and the intended users of the assess¬ 
ment [2]. 

The intended users or target audiences of the assessment report 
will influence its content, presentation, and dissemination strategy. 
Health care professionals, researchers, government policy-makers, 
and others have different interests and levels of expertise [2]. The 
scientific or technical level of reports, the presentation of evidence 
and findings, and the format of reports vary by target audience. 

There are different ways of thinking about what a health technol¬ 
ogy does and what implications it has. In most instances, HTA 
focuses on whether the technology works for its intended purpose 
and whether it works better than other technologies. This focus 
on appropriateness does not usually consider the “side effects” of 
a technology, or what types of impacts it may have outside of its 
intended use. Each HTA must consider the social implications of 
the technology for stakeholder groups such as patients that use 
the technology, other patients in the system, health care providers, 
family members, payers, producers/industry, or society [11]. 
At the same time, a health technology may have some serious 
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2.3.1 Approaches 
to Identifying Social 
Impacts 


ethical issues associated with its use and dissemination which must 
be identified and evaluated. There have been a number of 
approaches proposed to assess both the social and ethical dilem¬ 
mas within an HTA. 

A social approach to HTA is concerned with understanding 
the impacts that a technology has beyond what it is intended to do. 
Some technologies might have more serious social impacts than 
others. A social approach to HTA is interested in examining every¬ 
thing a technology might do or affect and integrating those issues 
into the assessment of the technology’s value. This type of approach 
considers the (positive, negative, unexpected) impacts a technol¬ 
ogy may have in all spheres of social life, at both the micro and 
macro level. For instance, the impacts on a person’s individual role 
in their family, community, or job should be considered as well as 
the impacts to broader groups of society (e.g., particular social 
groups or society in general) in terms of culture, norms, and values 
[11]. Technologies with many far-reaching impacts have been 
called “morally challenging” [12] because they pose moral issues 
which are broader than the specific technology and people who 
come into direct contact with that technology (e.g., in vitro fertil¬ 
ization). Other technologies will have fewer ethical or social impli¬ 
cations or they may impact smaller groups of people. 

Unlike clinical and economic assessments, which use empirical 
evidence to correctly explain and predict outcomes of a technol¬ 
ogy, the first step in assessing social issues is to identify what issues 
should be considered. There are several ways to approach the iden¬ 
tification of issues, including engagement with citizen and patient 
organizations, primary research, virtual forums, expert consulta¬ 
tion, and the synthesis of published qualitative literature [11]. 

Qualitative research uses interviews, focus groups, observa¬ 
tion, and many other approaches to examine the opinions, beliefs, 
and experiences of different groups of people. Examining pub¬ 
lished qualitative research about the technology or class of tech¬ 
nologies in an HTA can be very helpful in identifying potential 
issues that should be considered in any recommendations. Looking 
to see what users, providers, or the public have said about that 
technology (or class of technologies) can corroborate concerns 
that the analysts have already identified as well as reveal unexpected 
issues. It can help to identify and characterize potential problems 
related to that technology. Qualitative research can also suggest 
values, goals, and outcomes that matter to patients and could be 
incorporated into other aspects of the HTA, for instance, with 
regard to the outcome measures chosen for comparison. Examining 
existing literature on the technology could be considered due dili¬ 
gence in order to understand the concerns and perspectives of 
patients, providers, and the public. It can give access to perspectives 
that may not be easily available to HTA decision-makers, such as 
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2.3.2 Ethical 
Analysis in HTA 


the perspectives of marginalized or vulnerable patients who may 
not be able to easily participate in public engagement exercises. 

After identifying potential social impacts of a technology, the 
next step is to examine who will be affected by those impacts and 
how they will be affected. This assessment requires an understand¬ 
ing of the values, stakeholders, and domains that may be affected. 
Values are ideas about what “ought to be” instead of statements 
about what “is.” They may include formal ethical values but may 
also include broader ideas of what an organization or society thinks 
is good, what ideals they are committed to upholding, etc. Some 
agencies have produced lists of core values that they use to guide 
their consideration of social and ethical issues [13-15]. Values may 
also be identified from the literature, stakeholders, or experts. 
These may include explicit values such as autonomy, dignity, equity, 
patient-centered care, or resource stewardship. HTA is also 
informed by implicit values such as effectiveness, efficacy, scientific 
evidence, etc. 

Traditionally ethical analysis is performed through normative 
reflection on ethical questions around the technology of interest, 
based on ethical principles and theories. In this approach, the 
reflection can take place along different philosophical perspectives. 
For example, utilitarianism promotes maximization of benefits for 
the greatest number of people; deontological ethics focuses on 
duties, rules, and obligations; while virtue ethics emphasizes moral 
character and virtues of individuals [16]. When a principle-based 
method is used, the ethical reflection is generally directed at the 
question of whether the consequences resulting from implementa¬ 
tion of a specific technology can be justified by the four prima facie 
bioethical principles of respect for autonomy, beneficence, nonma¬ 
leficence, and justice proposed by Beauchamp and Childress [17]. 

More recently proposed methods promote the use of partici¬ 
patory and interactive approaches in addition to ethical reflection. 
Participatory models involve diverse stakeholders and citizens in 
the processes of assessment in order to learn about their personal 
and societal value positions and obtain their concerns about the 
technology and its alternatives [18]. 

Prior to the utilization of an ethical assessment method, it is 
important to consider its potential limitations. For example, nor¬ 
mative approaches require an adequate knowledge of ethics and 
ethical theories, which may not be available within most HTA 
organizations. In addition, they can be affected by the ethicists’ 
own prereflective values [19]. Participatory methods, on the other 
hand, are usually costly, time consuming, and complex to perform. 
Other challenges that HTA developers might face when attempt¬ 
ing to address ethical issues of a health care technology include 
lack of consensus on a practical method of considering ethical 
issues in HTA [20, 21], complexity of the collection and processing 
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2.4 Sources 
of Research Evidence 
for HTA 


of qualitative data, the institutional barriers related to the attitudes 
of researchers, and the availability of required resources. The choice 
of a method for collection and analysis of ethical data should be 
based on the context in which technology is being assessed, the 
purpose of analysis, and availability of required resources [18]. 

After an examination of relevant issues in light of identified 
social and ethical values, the next steps vary greatly, depending on 
the purpose of the HTA, the commissioning agency, and the issues 
identified. For example, the evaluation of both the ethical and 
social issues may help to inform the protocol for gathering evi¬ 
dence; to inform other types of assessments; they may suggest par¬ 
ticular recommendations about funding or implementation of the 
technology; or they may be used during deliberation about the 
evidence presented. As a result, it is recommended that an analysis 
of the social and ethical issues be performed throughout the HTA 
process [22]. 

In summary, social and ethical issues are an important aspect of 
the HTA process; however, methods for assessing ethical and social 
implications of health care technologies are still being developed 
and variable, and the means of translating these implications into 
policy are often unclear [19]. 

One of the great challenges in HTA is to assemble all of the evi¬ 
dence relevant to a particular technology before conducting a 
qualitative or quantitative synthesis. Although some sources are 
devoted exclusively to health care topics, others cover the sciences 
more broadly. Multiple sources should be searched to increase the 
likelihood of retrieving all relevant reports [23]. 

A comprehensive search of the literature is a key step in any 
HTA that relies on the retrieval and synthesis of primary literature 
as the evidence base. Performing a comprehensive search following 
accepted practices will help to avoid missing relevant studies; avoid 
other potential biases such as publication, time lag, and language 
bias [24-26]; and assist the researcher in the provision of detailed 
search documentation, aiding transparency and increasing confi¬ 
dence in the assessment. As the literature search is the foundation 
of most HTAs, it is a reasonable supposition that a poor search 
would lead to a poor assessment. HTA search methods have, for 
the most part, been developed based on well-documented search 
methods employed in systematic review, modified in consideration 
of the general HTA audience and context [27]. Given the com¬ 
plexity of the methodology, nonprofessional searchers involved in 
conducting an HTA are advised to seek assistance from a librar¬ 
ian who is experienced in performing HTA or systematic review 
searching [28]. 
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Table 1 

Key and additional databases necessary for HTA 


Key databases 

PubMed [free, available: http://www.ncbi.nlm.nih.gov/entrez] 

Medline [$, available from various vendors] 

EMBASE [$, available from various vendors] 

The Cochrane Library (includes some/all of the following, depending on vendor: Cochrane 

Database of Systematic Reviews (CDSR); Database of Abstracts of Reviews of Effects (DARE); 
Cochrane Central Register of Controlled Trials (CENTRAL); Cochrane Database of 
Methodology Reviews (CDMR); Cochrane Methodology Register (CMR); Health Technology 
Assessment Database (HTA); NHS Economic Evaluation Database (NHS EED)) [$ for full text 
in most of Canada, available from various vendors] 

Centre for Reviews and Dissemination, University of York (UK) databases (includes: DARE 
(Database of Abstracts of Reviews of Effects); NHS EED (Economic Evaluation Database); 
Health Technology Assessment (HTA) Database; Ongoing Reviews Database) [free, available at: 
http: //www. crd .y ork. ac. uk /CRD Web / ] 

Additional databases that should also be considered based on scope, content, and availability 
Cumulative Index to Nursing & Allied Health (CINAHL) [$, available from various vendors] 
BIOSIS Previews [$, available from various vendors] 

EconLit [$, available from various vendors] 

Educational Resources Information Center (ERIC) [free, available at: http://www.eric.ed.gov/; 

$, available from various vendors] 

Health and Psychosocial Instruments (HAPI) [$, available from various vendors] 

Health Economic Evaluations Database (HEED) [$, available from John Wiley & Sons, Ltd.] 
PsycINFO [$, available from various vendors] 


2.4.1 Types of Literature The two main types of literature and information resources relevant 

to HTA are published and grey (or fugitive) literature. Bibliographic 
databases, being the primary source of published literature, are 
described by the U.S. National Library of Medicine (NLM) as 
“extensive collections, reputedly complete, of references and cita¬ 
tions to books, articles, publications, etc., generally on a single sub¬ 
ject or specialized subject area” [29]. The number of databases 
searched depends on the time, funds, and expertise available and is 
typically topic-dependant [30]. However, relying on one database 
exclusively is generally not considered adequate [28]. Table 1 pro¬ 
vides key and additional databases necessary for any reasonably 
comprehensive HTA. When considering which databases to search, 
consideration should also be given to the availability of special data¬ 
base functions, which might be of use (e.g., automatic updates). 

Grey literature consists of “reports that are unpublished, have 
limited distribution, and are not included in bibliographic retrieval 
systems” [31] and are usually not easily available. Examples of grey 
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2.4.2 Designing a Search 
Strategy 


literature include, but are not limited to: book chapters; census, 
economic, and other data sources; conference proceedings and 
abstracts; government and technical reports; newsletters; personal 
correspondence; policy documents; and theses and dissertations. 
Results from grey literature can make a significant contribution to 
an assessment, and it has been found that “the exclusion of grey 
literature from meta-analyses can lead to exaggerated estimates of 
intervention effectiveness” [32]. 

Sources of grey literature are abundant, and as with published 
literature, the number of sources searched depends on resources, 
topic, and a particular project’s or researcher’s needs [30]. Certain 
sources and tools may, however, be considered “core” for any 
HTA; these include library catalogues; search engines; and web¬ 
sites of HTA organizations, clinical trial registers, professional 
organizations, and regulatory agencies. A collaboratively produced 
collection of freely available and up-to-date resources available on 
the Internet includes many of these core sources and should not be 
missed [33]. An additional practical grey literature searching 
resource is produced and regularly updated by the Canadian 
Agency of Drugs and Technologies in Health (CADTH) [34]. 

Naturally, there is some crossover between published and grey 
literature. For example, theses and dissertations are sometimes 
considered grey literature, as one of their primary routes of access 
is through library catalogues. However, theses and dissertations 
from some institutions and countries are also available via biblio¬ 
graphic databases such as Proquest Dissertations and Theses or the 
Theses Canada Portal [35]. 

Designing a search strategy for the purposes of HTA is both a sci¬ 
ence and an art; the goal being to develop a strategy which is an 
optimal balance of recall versus precision in order to retrieve as 
many relevant records as possible, without having to sort through 
an unmanageable number of those which are irrelevant. Recall is 
the ratio of the number of relevant records retrieved to the total 
number of relevant records in the database, while precision is the 
ratio of the number of relevant records retrieved to the total num¬ 
ber of irrelevant and relevant records retrieved; the two are inversely 
related (Fig. 1) [36]. 

As stated in Subheading 2.2, the first step in designing an opti¬ 
mum search strategy is to clearly formulate the research question. 
Identifying key topics and translating them into a clearly focused 
question using the PICO(S) model (adapted from PICO [37]) is 
useful, where PICO(S) stands for: 

• Patient or population. 

• Intervention. 

• Comparator. 

• Outcome. 

• (S)tudy type. 



426 


Daria O’Reilly et al. 



Using this model will not only help in the development of an 
appropriate search strategy but also help define the study’s inclu¬ 
sion and exclusion criteria, and facilitate data extraction from pri¬ 
mary studies. 

The PICO(S) components can then be “translated” into lan¬ 
guage appropriate for the resources to be searched. In the case of 
bibliographic databases, the appropriateness of the language is 
dependent upon which database and interface is being searched 
(e.g., PubMed via NLM vs. Embase via OVID). Each database has 
its own language, or controlled vocabulary, and each interface has 
its own syntax, or naming system. One should consider how to 
best combine search terms to make the search most effective (e.g., 
Boolean operators: OR, AND, NOT). 

It is helpful to analyze “seed” documents or articles (previ¬ 
ously identified articles which are closely matched representations 
of the items one wishes to retrieve), if available, to identify search 
terms of interest. Once developed, the draft search strategy should 
then be tested and refined as required. 

Before running a final search, one should also consider how 
the results will be managed, as this will impact your record retrieval 
format. Use of bibliographic management software such as 
Reference Manager® or RefWorks® is recommended, along with 
accurate records of databases searched and how many references 
have been identified through various search methods so that an 
accurate search result diagram can be completed, according to 
PRISMA methods [38]. 
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2.5 Assessing 
the Quality 
of the Evidence 


2.6 Synthesize 
and Consolidate 
Evidence 


Evidence interpretation involves classifying the studies, grading the 
evidence, and determining which studies will be included in the 
synthesis. Assessors should use a systematic approach to critically 
appraise the quality of the available studies. Interpreting evidence 
requires knowledge of investigative methods and statistics [4]. 

The initial step in interpreting evidence is the establishment of 
criteria for its inclusion and role in the review; not all data available 
on a given technology may be suitable or equally useful for the 
purposes of a specific assessment [39]. The methods and presup¬ 
positions of both peer-reviewed publications and other material 
should be scrutinized on several grounds. Using formal criteria of 
the methodological rigor and clarity is essential for grading the 
data and their applicability to the assessment (e.g., User’s Guides 
to the Medical Literature [39, 40]). 

A key characteristic of a systematic review is an assessment of 
the validity of the findings of the included studies in order to mini¬ 
mize bias and provide more reliable findings from which conclu¬ 
sions can be drawn and decisions made [41-43]. 

For many topics in technology assessment, a definitive study that 
indicates one technology is better than another does not exist. 
Even where definitive studies do exist, findings from a number of 
studies often must be combined, synthesized, or considered in 
broader social and economic contexts in order to respond to the 
particular assessment question(s) [2]. 

Data synthesis may be narrative, such as a structured sum¬ 
mary and discussion of the studies’ characteristics and findings, 
or quantitative that involves statistical analysis. The statistical 
combination of results that is most frequently used in HTA is 
meta-analysis. The combining of studies increases the sample size 
and therefore more precise estimates of the treatment effects 
[43]. The Cochrane Collaboration has developed a software pro¬ 
gram, Review Manager (RevMan), that can perform a variety of 
meta-analyses, but it must be stressed that meta-analysis is not 
appropriate in all systematic reviews, for example if the outcomes 
measured are too diverse [43]. 

The synthesis of existing data is the most efficient HTA method 
if high-quality data are available. However, a problem arises when 
adequate evidence is limited, conflicting, or too uncertain in any or 
all of the relevant areas (e.g., efficacy, effectiveness, costs, quality of 
life). This highlights the problem the decision-maker often faces, 
confronted with the need to make reimbursement decisions and 
not knowing with certainty whether new health technologies are 
effective, safe, and cost-effective compared to existing technologies. 
Uncertainty creates problems for decision-makers because they are 
charged with choosing between various scenarios when there is 
insufficient definitive information on which to base decisions [44]. 
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In health care, the stakes for such decisions are high and may carry 
both high financial and health risks and rewards. For example, the 
decision-maker could potentially recommend an unfavorable tech¬ 
nology or reject a favorable technology, effectively denying access 
to a potentially beneficial health technology. Furthermore, the 
overarching imperative and responsibility for decision-makers is to 
make decisions, even if on poor quality evidence. In these situa¬ 
tion, decision-makers then have three options: not implement the 
technology, fully implement the technology, or conditionally 
implement the technology [39, 44]. If it is thought that a health 
technology holds great promise, the decision-maker may also rec¬ 
ommend that primary data collection, or a “field evaluation,” may 
be necessary to reduce this informational uncertainty about a tech¬ 
nology, thus yielding optimum site-specific recommendations [10, 
45,46]. 

2.7 Collection 
of Primary Data 
(as Appropriate) 

• Experimental or randomized controlled trials (RCTs). 

• Nonrandomized trial with contemporaneous controls. 

• Nonrandomized trial with historical controls. 

• Cohort study. 

• Case-control study. 

• Cross-sectional study. 

• Surveillance (e.g., registries, or surveys). 

• Case series. 

• Single case report. 

These methods are listed in rough order of most to least scien¬ 
tifically rigorous for internal validity (i.e., for accurately representing 
the causal relationship between an intervention and an outcome). 
Methods for collecting primary data to answer policy questions are 
continuing to evolve. While rigorous randomized controlled trials 
are necessary for advancing research or for achieving market access 
by regulatory bodies, they do not necessarily address the needs of 
health policy-makers. There has been an increase in the trend 
towards “pragmatic” clinical trials that are intended to meet these 
needs [47]. Pragmatic trials select clinically relevant alternative 
interventions to compare; recruit participants from heterogeneous 
practice settings; and collect data on a broad range of health out¬ 
comes (e.g., health-related quality of life). Pragmatic trials require 


If it has been determined that existing evidence will not adequately 
address the assessment question(s) and that primary data needs to 
be collected, there are a variety of methods that assessors can use 
to generate new data on the effects, costs, or patient outcomes of 
health care technology: 
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2.8 Economic 
Analysis in HI A 


that decision-makers become more involved in priority-setting, 
design, funding, study implementation, etc. [2, 47]. 

The careful collection of primary data on new technologies 
raises unique issues since it can be as logistically complex, time 
consuming, and as expensive as the rest of the assessment com¬ 
bined. The evaluation of technologies early in their life cycle may 
not be undertaken when the potential return on investment is 
unknown. Original evaluation consumes resources to such an 
extent that priority may be given to the assessment of technologies 
for which substantial data already exist [39]. This creates an ironic 
cycle in which technologies remain unevaluated until they are 
widely accepted into practice, at the risk of harmful consequences 
for patients and financial consequences for institutions and society 
[39]. As well, there is considerable debate about what kinds of 
study designs are “good enough” for addressing important HTA 
questions of effectiveness [48]. 

Overall, primary data collection for HTA is likely to be most 
beneficial when idealized efficacy has already been demonstrated in 
the controlled, but artificial, environment of a randomized clinical 
trial, and data on effectiveness, feasibility, and cost are needed from 
a relevant real-world setting [45, 46]. The front-end investment 
for field evaluations could potentially offset inappropriate larger 
investments downstream [49]. 

Once the benefits and risks are determined from the results of the 
systematic literature review and, in some cases, the field evaluation, 
decision-makers are often interested in determining whether the 
benefits of an intervention will be worth the health care resources 
consumed. In other words, does the technology represent good 
value for money? In these instances, an economic evaluation will be 
conducted to measure the incremental costs and benefits of the 
technology under review compared to one or more other tech¬ 
nologies. An economic analysis is a set of formal, quantitative 
methods used to compare alternative treatments, programs, or 
strategies in terms of both their costs and consequences [50-52]. 
Therefore the basic tasks of any economic evaluation are to iden¬ 
tify, measure, value, and compare the costs and effects of the alter¬ 
natives being considered [51]. The overall role of the economic 
analysis in an HTA is to provide information about the necessary 
resource consumption from the use of health technologies 
compared with the health outcome obtained. It is important to 
remember that economic evaluations seek to inform resource allo¬ 
cation decisions, rather than to make them. 

The identification of various types of costs and their subsequent 
measurement in monetary units is similar across most economic 
evaluations; however, the nature of the consequences stemming 
from the alternatives being examined may differ considerably [51]. 
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2.8.1 Cost-Effectiveness 
Analysis (CEA) 


2.8.2 Cost-Benefit 
Analysis (CBA) 


2.8.3 Cost-Utility 
Analysis (CUA) 


There are three types of economic analyses that can be relevant to 
consider as part of an HTA: cost-effectiveness analysis, cost-utility 
analysis, and cost-benefit analysis. 

In the cost-effectiveness analysis, it is necessary to identify, mea¬ 
sure, and value both cost and consequences of the compared health 
technologies. In this type of analysis, the consequences are mea¬ 
sured as a single measure or dimension of effect expressed in natu¬ 
ral units (e.g., death, life years gained). CEA can be performed on 
any alternatives that have a common effect (e.g., cost per mmHg 
drop in diastolic blood pressure obtained). CEA is somewhat lim¬ 
ited in decision-making since it cannot be used to compare treat¬ 
ment strategies with different outcomes. From this analysis, it is 
only possible to conclude which of the alternative technologies is 
cost-effective in relation to a specified goal [51]. 

In cost-benefit analysis, broadest type of economic analysis, conse¬ 
quences are valued in monetary units. The monetary value of an 
outcome is then compared to the cost of the intervention or its 
implementation [53]. For example, benefits could include averted 
medical costs or productivity losses associated with an early diag¬ 
nosis. Asking about a person’s willingness-to-pay for a specific out¬ 
come and treating the response as an expression of the preferences 
for, and the value of, the treatment is another way to value benefits. 
If the costs of the program are less than the benefits (e.g., costs 
averted) or societal values (e.g., willingness to pay), the program 
would be recommended. The clear advantage of this analysis is that 
the cost and consequences are now both measured in monetary 
units, from which the net benefit can immediately be calculated 
and whether the technology is worthwhile can be determined [51, 
54]. Due to the methodological challenges associated with CBAs 
(i.e., placing a monetary value on an outcome), this type of analysis 
is rarely used [53]. 

Finally, cost-utility analysis uses health-related quality of life as the 
measure of treatment effect. The term utility is used here to refer to 
the preferences individuals or society may have for any particular set 
of health outcomes (e.g., for a given health state). Utility analysis 
allows for quality of life adjustments to a given set of treatment out¬ 
comes, while simultaneously providing a generic outcome measure 
for comparison of costs and outcomes of different technologies. 
The generic outcome, usually expressed as quality-adjusted life years 
(QAFYs), is arrived at in each case by adjusting the length of time 
spent in a particular health state by the quality of life [55]. Utilities, 
or preferences, for health states, act as qualitative weights to com¬ 
bine the quantity and quality of life where a utility is measured on a 
scale from 0 to 1, with 0 representing death and 1 being perfect 
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health. Utility values can be estimated using either direct or indirect 
methods. Direct measurements of utilities involve using various 
techniques to elicit a persons’ preference for various health states. 
Some examples of direct elicitation techniques are time trade-off and 
standard gamble. Because these methods are resource and time 
intensive, indirect methods are more often employed. Indirect mea¬ 
sures of utility values such as the European EuroQoL-5D or the 
Canadian Health Utility Index self-administered questionnaires are 
more often used. 

CUA is important in instances when the quality of any life 
years gained is important (e.g., chronic diseases) or when there is a 
desire for an overall measure of effectiveness enabling comparisons 
to be made across the heath care sector [51, 54]. 

Factors like the disease area, the alternative technologies, the 
measurement and valuation of the consequences, as well as the 
purpose of the economic analysis are important in deciding which 
type of economic analysis should be chosen in any given case [54]. 

To be able to perform an economic analysis of the technology 
in question, there has to be at least one relevant alternative tech¬ 
nology with which it may be compared. The cost-benefit analysis 
can, however, be conducted for only one technology. An economic 
analysis aims to answer the questions regarding whether a new 
health technology is cost-effective compared to current practice, 
which it is supposed to replace, and whether the technology is 
cost-effective in general compared to other optimally cost-effective 
technologies. To be relevant to decision-making, the chosen alter¬ 
native for the economic analysis should at least represent the cur¬ 
rent health technology or practice which the new health technology 
is expected to replace [56]. 

In some situations it is necessary to model the economic anal¬ 
ysis. Extrapolation of short-term clinical data with the purpose of 
predicting these data in the long run or the extrapolation of inter¬ 
mediate measures of effectiveness to final measures of effective¬ 
ness (e.g., high cholesterol as a risk factor for myocardial infarction) 
might be reasons for the use of modeling in economic analysis. 
Additionally, economic and clinical data can be missing, especially 
in the early development phase of a health technology. In such a 
situation the economic analysis may be entirely modeled and 
based on the best evidence available. Decision trees and Markov 
models are two of the most frequently used types of modeling 
approaches [54]. 

The checklist for economic analyses, presented below, can be 
used as a list for what should be remembered in the conduct of an 
economic analysis as part of an HTA as well as to provide the reader 
with an impression of the quality of published economic analyses 
(Table 2). 
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Table 2 

A checklist for economic analyses (adapted from Drummond et al. [51] and Poulsen [54]) 

1. Is a there a well-defined question including whose perspective to take during the analysis? 

2. Are the relevant competing alternatives included and described in the analysis? 

3. Is the effectiveness of the compared technologies documented with sources? 

4. Are all relevant costs and consequences, corresponding to the perspective, identified? 

5. Are costs and consequences of the technologies measured in appropriate units? 

6. Are costs and consequences valued credibly? 

7. Are differential timing of costs and consequences handled including discounting? 

8. Are sensitivity an 

investigate the robustness of this analysis and its conclusion? 

9. Are the conclusions in the analysis presented as a ratio of costs and effects? 

10. Are the conclusions valid and generalizable, and are all interested parties considered? 


alyses carried out to test for uncertainty in the economic analysis and to 


2.9 Formulation 
of Findings and 
Recommendations 


A project’s findings and recommendations are the central elements 
of interest for most readers. Findings are the results or conclusions 
of an assessment; recommendations are the suggestions or advice 
that follow from the findings and should be phrased in a format 
parallel to that of the statement of the original questions. Where 
conclusions cannot be reached from the evidence considered, some 
commentary is needed regarding why certain questions cannot be 
answered [39]. Heath technology assessments should link explic¬ 
itly the quality of the available evidence to the strength of their 
findings and recommendations as well as any limitations. Doing so 
facilitates an understanding of the rationale behind the assessment 
findings and recommendations. It also provides a more substantive 
basis on which to challenge the assessment as appropriate. Further, 
it helps assessment programs and decision-makers determine if a 
reassessment is needed as relevant evidence becomes available [2]. 


2.10 Dissemination 
of Findings and 
Recommendations 


One of the fundamental aspects of an HTA is to translate the sci¬ 
entific data and research results into information that is relevant to 
health care decision-makers through the dissemination of the find¬ 
ings [10]. The results should be available to others interested in 
the problem through the published literature and informal colle¬ 
gial communication [39]. Dissemination strategies depend upon 
the mission or purpose of the organization sponsoring the assessment. 
Dissemination should be planned at the outset of an assessment 
along with other assessment activities and should include a clear 
description of the target audience as well as appropriate mecha- 
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nisms to reach them [2]. Agencies have been established that 
provide a forum for HTA agencies and users of HTA reports 
to exchange information such as the International Network of 
Agencies for Health Technology Assessment (INAHTA) http:// 
www.inahta.org/Home and Health Technology Assessment inter¬ 
national (HTAi) http://www.htai.org/index.php?id=419. 

Technology assessments that appear inconclusive should not 
be withheld from distribution. If they are methodologically sound, 
such assessments may be both useful to policy-makers and clini¬ 
cians and also serve as a point of departure for future research [39]. 

2.11 Monitoring The impact of HTA reports is variable and inconsistently evaluated. 

impact of Assessment Some HTA reports are translated directly into policies with clear 
Reports and quantifiable impacts (e.g., acquisition or adoption of a new 

technology; reduction or discontinuation in the use of a technol¬ 
ogy; or change in third-party payment policy), while the findings 
of others go unheeded and are not readily adopted into general 
practice [2]. It is important to keep in mind that HTA results will 
be only one of the many inputs that determine policy decisions. 

Since considerable amounts of scarce resources are invested in 
HTA, monitoring the impact of an evaluation is essential to maxi¬ 
mizing its intended effects and preventing the harmful repercus¬ 
sions of misinterpretation or misapplication [39, 57] . An assessment 
project should include a plan for the follow-up evaluation of its 
report [39]. Because technology assessment is an iterative process, 
new information or changes in the technology may require the re- 
evaluation of the project’s original conclusions [39]. 


3 Case Study: Health Technology Assessment of Drug Eluting Stents Compared 
to Bare Metal Stents for Percutaneous Coronary Interventions in Ontario 

The following HTA was conducted by the Programs for 
Assessment of Technology in Health (PATH) Research Institute, 
St. Joseph’s Healthcare Hamilton in Hamilton, Ontario. This 
research group uses an iterative evidence-based framework for 
reducing uncertainty around a health technology to provide 
information back to Health Quality Ontario and ultimately to 
the Ontario Ministry of Health and Long-term Care (MOHLTC) 
to assist them in making more informed evidence-based health 
policy recommendations. PATH’S Reduction in Uncertainty 
through Field Evaluations (PRUFE) iterative evidence-based 
framework is presented in Fig. 2. 

3.1 Topic Prior to 2003, bare metal coronary artery stents (BMS) were 

identification commonly being used as part of the percutaneous coronary inter¬ 

ventions (PCI) procedure for patients with coronary artery disease 
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Policy decision making 


Fig. 2 Programs for Assessment of Technology in Health’s (PATH’S) reduction of uncertainty through field evalu¬ 
ation (PRUFE) iterative evidence-based decision-making framework 


(CAD). However, patients receiving this intervention still had a 
relatively high restenosis rate (i.e., 15-20 % within a year) [58-60]. 
The drug eluting stents (DES), licensed in Canada in 2002, held 
the promise of reducing these rates and thus the potential for 
increased diffusion throughout the health care system. 

3.2 The Assessment Prior to the introduction of DES in Ontario, the MOHLTC 

Problem conducted an internal review of the literature pertaining to 

DES. The review determined that the efficacy of DES compared to 
Bare Metal Stents (BMS) had been demonstrated in a limited 
number of published randomized controlled clinical trials, and 
thus there was uncertainty surrounding the efficacy and cost- 
effectiveness of DES in a “real-world” setting. These new stents 
were significantly more costly than the technology they were meant 
to replace, and the MOHLTC estimated that the budget for coro¬ 
nary artery stents (i.e., BMS and DES) would require an additional 
$7.5-$28.6 million per year [61]. Due to the informational uncer¬ 
tainty and the high budget impact, the MOHLTC requested that 
PATH conduct an HTA including a field evaluation and economic 
analysis to examine the utilization, effectiveness, and costs associ¬ 
ated with the introduction of DES into the Ontario health care 
system in order to make a reimbursement decision. 
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3.3 Sources 
of Research Evidence 
for HTA 


3.4 Interpretation, 
Synthesis, 
and Consolidation 
of Evidence 


3.5 Findings 


Due to the lack of Ontario-specific information pertaining to the 
real-world use of DES, a “field evaluation” and economic analysis 
was conducted by PATH during the time of introduction. In order 
to obtain Ontario-specific “real-world” data, the MOHLTC 
provided 12 Cardiac Care Centres with annual renewable funding 
for 1 year for DES conditional on data collection for the field eval¬ 
uation (coverage with evidence development). The resulting pro¬ 
spective population-based registry of all consecutive percutaneous 
coronary intervention (PCI) procedures in the province of Ontario 
between 2003 and 2006 provided an unbiased estimate of the 
effectiveness of this technology in the local environment. This data 
allowed for the comparison of DES to BMS with respect to repeat 
revascularization rates (i.e., PCI and coronary artery bypass 
surgery) and all-cause mortality. Uptake and health care resource 
utilization data were also collected. 

Concurrently, PATH continued the systematic literature review 
initiated by the MOHLTC to identify any new evidence pertaining to 
efficacy of DES. The results of the systematic literature review were 
examined alongside the evidence obtained from the field evaluation. 

Only RCTs comparing DES to BMS providing relevant clinical 
outcomes (i.e., revascularization rates, acute myocardial infarction, 
and mortality) were included in the systematic literature review. 
Study results were both qualitatively and quantitatively (i.e., meta¬ 
analysis) summarized. 

Appropriate statistical methods were employed to measure 
outcomes for an observational study design to control for potential 
differences in baseline characteristics (from field evaluation). A 
naive economic model was developed and populated with the 
results from the field evaluation and published literature where 
data were lacking. 

• The interim results from the field evaluation indicated that 
DES reduced predicted revascularization rates in some but 
not all patient cohorts at one year compared to bare metal 
stents (BMS). 

• In non-post-myocardial infarction (MI) patients, DES appeared 
to be most effective in reducing the need for revascularization 
in patients with long or narrow lesions (“high-risk” patients). 
This benefit was magnified in patients with diabetes. 

• DES also appeared to be effective in post-MI patients. 
However, further data collection is required in order to con¬ 
firm the benefit of DES by lesion type in this patient cohort. 

• DES compared to BMS did not appear to provide a reduction 
in revascularization rates in patients with short and wide 
lesions, in patients with and without diabetes. 
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3.6 Dissemination 


3.7 impact 
of Assessment Resuits 


PATH has presented the results of the field evaluation, economic 
evaluation, and systematic literature review to the Ontario Health 
Technology Advisory Committee (OHTAC) on a number of 
occasions. For example, preliminary results for the DES field eval¬ 
uation were presented on two separate occasions in 2005. Based 
on these presentations, OHTAC requested more specific analyses 
of data in patients groups at higher risk of restenosis, in particular 
patients having diabetes, narrow lesions, or long lesions [62]. 
These presentations provided the Committee with the opportu¬ 
nity to make recommendations about further analysis. This itera¬ 
tive process ensures that the information provided by the 
researchers is in line with the needs of decision-makers in order to 
make reimbursement decisions. 

The final HTA report of the field evaluation of the DES was 
completed and is available through the PATH web site via the 
following link: http://www.path-hta.ca/Libraries/Reports/ 

D ESr epor tMay2007. sflb. ashx . 

The findings of the HTA of DES have also been presented 
to other provincial governments (e.g., Agence devaluation des 
technologies et des modes d’intervention en sante [AETMIS] 
in Quebec) and to key stakeholders (e.g., Cardiac Care Network 
(CCN) of Ontario with representatives from nursing, medical, 
government, and administrators). Additionally, the results have 
been presented at national and international peer-reviewed sci¬ 
entific conferences (i.e., Society for Medical Decision Making, 
Canadian Association for Population Therapeutics, and 
International Society for Pharmacoeconomics and Outcomes 
Research). 

In March of2007, OHTAC made the following recommendations 
to the Deputy Minister of Health of Ontario with regard to DES 
for PCI interventions in Ontario: 

1. DES be offered to those patients considered for stent place¬ 
ment and who have: 

(a) Diabetes. 

(b) Long lesions (greater than 20 mm) and/or narrow lesions 
(less than or equal to 2.75 mm). 

2. That PATH continue to collect data on patients who received 
DES. 

3. That the current support for DES not be increased at this time. 

4. These recommendations be provided to hospitals and cardiol¬ 
ogists as soon as possible. 


Evidence-Based Decision-Making 3: Health Technology Assessment 


437 


4 Discussion 


The main purpose of HTA is to consolidate the best available 
evidence on technologies, so the results can have value in decision¬ 
making in which clinical practice and heath policy are concerned. 
When well-conceived and implemented, HTA can make an impor¬ 
tant contribution to the proper distribution of resources, to the 
selection of cost-effective interventions, and to greater efficiency 
and more effective services [63]. 

The proper timing of the assessment of a health technology 
warrants special attention, as it can be very complex. Assessments 
can be conducted at any stage in a technology’s life cycle to meet 
the needs of a variety of stakeholders (e.g., investors, regulators, 
payers) and each may need to subsequently reassess technologies [2]. 
However, a trade-off exists between decision-makers’ wish for early 
assessment prior to widespread diffusion of health technologies 
and the problem of reliability and certainty of the information 
available early in the life cycle, thus leading to potential errors in 
decision-making. At the same time, late assessment runs the risk 
that the technology has already been used widely and costs have 
been incurred. Difficulties in convincing providers to discard an 
intervention once it has been introduced into clinical practice illus¬ 
trate this dilemma [64, 65]. This quandary in decision-making has 
been formulated by Martin Buxton as “it’s always too early to eval¬ 
uate until, unfortunately, it’s suddenly too late!” [66]. 

Compounding this problem is the fact that the stages of a tech¬ 
nology’s life cycle are often not clearly delineated, and technolo¬ 
gies do not necessarily mature through them in a linear fashion. A 
technology may be established for certain applications and may be 
investigational for others. A technology once considered obsolete 
may return to established use for a better-defined or entirely differ¬ 
ent purpose. Technologies often undergo multiple incremental 
innovations after their acceptance into practice [2]. As a result, 
HTA must be viewed as an iterative process. It may be necessary to 
revisit a technology when competing technologies are developed, 
the technology itself evolves, or new information is introduced. 
Reassessment may require additional data collection and ongoing 
assessments may be enhanced with techniques that aggregate 
results of research [2]. 


5 Concluding Remarks 

This chapter provides a basic framework for conducting an HTA as 
well as some fundamental concepts and challenges of the dynamic 
field of health technology assessment with the ultimate goal of 
encouraging the uptake of efficient and effective health care 
technologies. 
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There are no standard methods for conducting HTA and new 
methods are continuously evolving. In fact, a variety of methods 
are often used depending on the purpose of the assessment, the 
resources available, the context and setting, etc. In any event, the 
general trend in HTA is to call for and emphasize more rigorous 
methods [2]. There is little argument that RCTs are an accepted 
high standard for testing efficacy under ideal circumstances, but 
they may not be the best means to evaluate all interventions and 
technologies that decision-makers are considering [67]. 
Observational studies with analyses that consider potential bias 
offer an opportunity to capture data from community practices 
costing less than randomized trials. In some cases, the process of 
performing an effective HTA can also include the process of col¬ 
lecting primary data through pragmatic trials, through local 
research initiatives [45, 46]. As a result, it is important for 
decision- and policy-makers to have a basic understanding of the 
basic research methods, ethical and sociocultural issues that may be 
taken into consideration in an HTA. 

The future of HTA is not easy to predict. One thing is clear, 
however, HTA is here to stay. The need to contain costs and to 
target the use of technologies in areas that represent the best value 
for money will mean that decision-makers need more high-quality 
information on technologies’ impacts [68]. Coverage decisions are 
already made more and more frequently based on HTA. Still, 
implementing HTA results into clinical practice remains a formi¬ 
dable challenge [3]. 

Whether HTA is beneficial—supporting timely access to 
needed technologies—or detrimental depends on three critical 
issues: when the assessment is performed; how it is performed; and 
how the findings are used. Heightened demand for technology 
assessment arising from private and public organizations’ quest for 
value in health care is pushing the field to evolve keener processes 
and assessment reports tailored for particular user groups [2]. 
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Evidence-Based Decision-Making 4: Development 
and Limitations of Clinical Practice Guidelines 

Bruce Culleton 

Abstract 

Clinical practice guidelines are systematically developed statements to assist practitioners and patients reach 
appropriate health care decisions. If developed properly, clinical practice guidelines assimilate and translate an 
abundance of evidence published on a daily basis into practice recommendations and, in doing so, reduce the 
use of unnecessary or harmful interventions, and facilitate the treatment of patients to achieve maximum 
benefit and minimum risk at an acceptable cost. Traditionally, clinical practice guidelines were consensus- 
based statements, often riddled with expert opinion. It is now recognized that clinical practice guidelines 
should be developed according to a transparent process involving principles of bias minimization and 
systematic evidence retrieval and review, with a focus on patient-relevant outcomes. The process for the 
development, implementation, and evaluation of clinical practice guidelines is reviewed in this chapter. 

Key words Clinical practice guidelines, Clinical practice recommendations, Critical appraisal, 
Guideline grading, Implementation, Evaluation 


1 Introduction 


Clinical practice guidelines (CPG) are systematically developed 
statements to assist practitioners and patients reach appropriate 
health care decisions. Their purpose is “to make explicit recommen¬ 
dations with a definite intent to influence what clinicians do” [1]. 
If developed properly, CPG assimilate and translate the abundance 
of evidence published on a daily basis into practice recommenda¬ 
tions and, in doing so, reduce the use of unnecessary or harmful 
interventions, and facilitate the treatment of patients to achieve 
maximum benefit and minimum risk at an acceptable cost. CPG are 
not meant to replace sound medical decision-making which takes 
into account critical elements relevant to patient care including 
patient preferences and clinician experience. 

Traditionally, CPG were consensus-based statements, often 
riddled with expert opinion. Frequently, expert opinion was 
associated with bias (often nonintentional) and consensus-based 
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recommendations are currently viewed with a certain amount of 
skepticism. It is now recognized that CPG should be developed 
according to a transparent process involving principles of bias 
minimization and systematic evidence retrieval and review, with a 
focus on clinically relevant outcomes. The process for the devel¬ 
opment, implementation, and evaluation of CPG is reviewed in 
this chapter. 


2 Principles of Clinical Practice Guideline Development 


2.1 Determine 
the Guideline Topic, 
Scope, and Target 
Audience 


The National Health and Medical Research Council of Australia 
has published CPG development principles [2]. Briefly, these prin¬ 
ciples state: 

• Processes for developing and evaluating CPG should focus on 
outcomes. 

• CPG should be based on the best available evidence and graded 
according to the level, quality, relevance, and strength of 
evidence. 

• CPG development should be multidisciplinary and include 
consumers. 

• CPG should be flexible and adaptable to local conditions. They 
should include evidence for different target populations and 
take into account patient preferences. 

• CPG should be developed with resource constraints in mind. 

• Implementation plans should be developed along with CPG. 

• The implementation of CPG should be evaluated. 

• CPG should be revised regularly to account for new evidence. 

Details to adhere to these principles and other relevant issues in 
the development of CPG are discussed in the sections that follow. 

CPG are often developed by professional medical societies or asso¬ 
ciations in response to a perceived need, such as variation in the 
delivery of care by health care providers for the same medical prob¬ 
lem. Ideally, a needs assessment is performed with the user of the 
guidelines in mind. When CPG topics are ultimately selected, there 
must be a clear purpose and a defined problem to be addressed. 
The audience is usually obvious but this should be stated in the 
process of CPG development. 


2.2 Convene 
a Guideline Chair 
and Committee 
to Oversee Guideline 
Development 


The committee should consist of members with clinical expertise 
relevant to the topic and representatives from all pertinent groups 
including consumers and other allied health professionals, as appli¬ 
cable. If CPG are to be relevant, those who are expected to use 
them should participate in their development. Although the com¬ 
mittee’s precise composition will depend upon a number of factors, 
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2.3 Identify 
the Health Outcomes 
and the Appropriate 
Questions 


2.4 Retrieve 

the Scientific Evidence 


2.5 Formulate 

the Guidelines and/or 

Recommendations 

2.5.1 Phrase 
the Statement 


2.5.2 Grade the Evidence 


strong consideration should be given to members with an ability 
to critically appraise published articles and with experience on 
other CPG workgroups. Health economists, bioethicists, and rep¬ 
resentatives of regulatory authorities may be relevant to certain 
committees. 

The outcomes and questions will differ depending upon the topic 
under consideration. Outcome considerations include patient¬ 
relevant hard outcomes (e.g., all-cause or cardiovascular mortality), 
other patient-relevant outcomes (e.g., quality of life, hospitalization 
rates), surrogate outcomes (e.g., LDL cholesterol lowering, blood 
pressure reduction), process-related outcomes (e.g., re-admission 
rates, relapse rates), and patient satisfaction outcomes, to name a 
few. In general, it is accepted that these outcomes differ and the 
choice of outcome depends upon the topic, scope, and target 
audience. Within the realm of CPG, emphasis should be placed on 
patient-relevant outcomes. 

Recommendations placed in CPG should be based on the best 
possible evidence. Therefore, if possible, a systematic literature 
review should be performed very early in the course of guideline 
generation. The method chosen for the systematic review can 
range from highly structured quantitative syntheses of the litera¬ 
ture (e.g., meta-analyses) to a subjective overview of observational 
data. The committee must decide between rigor and pragmatism 
[3] taking into consideration the extra costs and time to perform 
formalized quantitative reviews. Regardless of choice, the methods 
used to review the literature must be clearly stated. 

Although guidelines can be presented as charts or flow diagrams, 
most often than not, guideline statements are presented as free 
flowing text. The guideline or recommendation statement should 
be as clear and concise as possible. Abbreviations should be avoided 
in the statement itself and statements should be phrased to initiate 
an action using terms such as “should” or “consider.” Negatively 
phrased statements are frequently avoided. 

The grading of evidence is an integral part of any CPG development 
process. Although there is considerable variation in the grading 
systems used across and within specialty groups, there are several 
core factors necessary for the success of grading systems. First, the 
CPG Workgroup members should have experience in the relevant 
clinical area and have expertise with critical appraisal. Second, the 
grading system must be structured, explicit, and above all else, 
transparent. It should be obvious to the reader how a grade was 
applied to the evidence. The majority of structured grading 
schemes take into account study design, methodological quality, 
and the population studied. Consistency of effects across studies 
is also an important consideration within grading schemes. 
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Third, grading schemes should explicitly consider the balance 
between benefit and harm. Finally, the grading scheme should incor¬ 
porate and define clinically important patient-relevant outcomes. 

Due to the considerable variation among grading systems, 
several groups have promoted the use and implementation of stan¬ 
dardized grading schemes. Using a standardized system, it is argued 
that comparison of recommendations between groups would be 
facilitated. A standardized scheme would also lessen the likelihood 
of several societies producing different evidence grades when using 
the same evidence base, thereby reducing confusion and promoting 
effective communication. 

At the present time, the GRADE (Grades of Recommendation 
Assessment, Development and Evaluation) Working Group appears 
to have momentum for harmonizing grading schemes. The 
GRADE Working Group is composed of international experts in 
the field of evidence-based medicine and guideline development. 
The GRADE approach has the benefits of providing a structured 
approach to grading the quality of evidence for questions regard¬ 
ing interventions and it explicitly identifies how grades are derived 
and where judgment is involved. It has also been adopted by other 
organizations [4-6]. The grading of evidence by this group takes 
into account four key elements: study design, study quality, consis¬ 
tency, and directness. A summary of the criteria used by the 
GRADE working group is shown in Table 1 [7]. 

As one can see from Table 1, the level for the quality of the 
evidence can be “high,” “moderate,” “low,” or “very low.” For a 
question of intervention, the quality grade for an aggregate of 


Table 1 

Criteria for assigning grade of evidence for questions involving interventions 


Type of evidence 
Randomized trial = high 
Observational study = low 
Any other evidence = very low 

Decrease grade if 

• Serious (-1) or very serious (-2) limitation to study quality 

• Important inconsistency (-1) 

• Some (-1) or major (-2) uncertainty about directness 

• Imprecise or sparse data (-1) 

• High probability of reporting bias (-1) 

Increase grade if 

• Strong evidence of association—significant relative risk of >2 (<0.5) based on consistent evidence 
from two or more observational studies, with no plausible confounders (+1) 

• Very strong evidence of association—significant relative risk of >5 (<0.2) based on direct evidence 
with no major threats to validity (+2) 

• Evidence of a dose-response gradient (+1) 

• All plausible confounders would have reduced the effect (+1) 
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2.5.3 Report 
the Rationale 

for the Guideline Statement 
and the Applied Grade 


2.6 Address 
Transparency 
and Conflict of Interest 


randomized controlled trials would start at the entry level of 
“high”; a collection of observational studies would start at the 
entry level of “low” and evidence from studies of other designs at 
“very low.” Subsequently, the level for the quality of evidence for a 
particular outcome is reduced or downgraded, if there are limita¬ 
tions to the methodological quality of the studies, inconsistencies 
between studies, limitations of the directness of the evidence (i.e., 
the evidence does not apply directly to the populations, interven¬ 
tions, or outcomes of interest), imprecise or sparse data, or a high 
probability of reporting bias. On the other hand, the level of evi¬ 
dence of observational studies would be raised or upgraded if there 
is evidence of a strong or very strong association between the 
intervention and the outcome, if a dose-response gradient exists, 
or if unmeasured confounders are minimal. The final grade for the 
quality of the evidence cannot move higher than to “high” level, or 
lower than to “very low” level. More information on the GRADE 
system can be obtained at http://www.gradeworkinggroup.org. 

The scientific basis on which the guidelines were developed should 
be clearly stated including the strength and consistency of the 
evidence. Evidence tables can be helpful particularly when multiple 
studies are used to generate the guideline statement. In cases in 
which the evidence is lacking or poor, any uncertainty or disagree¬ 
ment amongst committee members should be documented. 

In a cross-sectional survey of authors of 44 CPG developed for 
common adult diseases and published between 1991 and July 
1999, 87 % of authors reported some form of interaction with the 
pharmaceutical industry [8]. On average, CPG authors interacted 
with 10.5 different companies and 59 % had relationships with 
companies whose drugs were considered in the guideline they 
authored. Fifty-five percent of the respondents also indicated that 
the guideline process with which they were involved had no formal 
process for declaring these relationships. In another recent report 
on more than 200 guidelines from various countries, it was found 
that more than one-third of the authors declared financial links to 
relevant pharmaceutical companies [9]. These links included 
research grant support, personal compensation for lectures or 
advice, or even stock holdings and patents. 

It also appears that the majority of consensus and guideline 
development processes are supported either directly or indirectly 
(through “unrestricted” grants to medical specialty societies or 
national disease associations) by pharmaceutical companies with 
vested interests. For example, Amgen (makers of erythropoietic 
stimulating agents) and DaVita (a US company that provides dialysis 
services) have been criticized for their close relations and sponsor¬ 
ship of anemia management guidelines developed by the National 
Kidney Foundation. It is not surprising, therefore, that guidelines’ 
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committees have been criticized for effectively increasing the sales 
and profits of related companies [10]. 

Given that any perceived influences lessen the credibility and 
undermine the impartiality of the CPG process, it is important to 
minimize inappropriate influence. Several approaches have been 
suggested including the following: 

• Full government or societal sponsorship of guideline develop¬ 
ment and implementation with no financial connections to 
companies, partnerships, or individuals who are involved with 
the manufacture, sale, or supply of health technologies in any 
way relevant to the guidelines. Such an approach would also 
restrict committee members to those with no conflicts of inter¬ 
est. Supporters of this approach maintain that practice recom¬ 
mendations will always be viewed with skepticism unless 
industry ties are completely avoided. Critics of the approach 
point out that the exclusion of authorities or experts who 
have received compensation from companies with a vested 
interest would likely compromise the expertise of the guide¬ 
line committee [11] and perhaps limit the experienced human 
judgment necessary to make decisions easier at the practice 
level. The ability or willingness of governments or national 
associations to create independent and financially secure com¬ 
mittees is also questionable. 

• Avoidance of a single corporate sponsor, as suggested by 
Narins and Bennett [12]. Industry contributions can be placed 
in a common pool for the development of all guidelines, not 
just those related to the interests of the contributor. 

• A process to minimize conflicts and optimize transparency 
while ensuring the Chair is free of conflict. The National 
Institute for Clinical Excellence in the United Kingdom (www. 
nice.org.uk) requires its members to declare financial and other 
interests and if identified, “the individuals are required to stand 
down and not take part in the relevant decision-making pro¬ 
cess for that project.” Although such public disclosure may 
heighten reader’s skepticism, it does not release authors from 
the potentially comprising ties to industry. 

• Policies for individuals with inappropriate influence to stand 
down from relevant discussions or voting. 

• Disclosure statements of sponsorship details for the medical 
specialty societies or national associations (not just for the 
CPG process). 

• Procedures to ensure the choosing of committee chairs and 
committee members are transparent. 

• Rules to ensure potential conflicts of interest are transparent. 


Evidence-Based Decision-Making 4... 


449 


2.7 Develop 
an Implementation 
Strategy 


2.8 Consultation 


• Regulations for members to recuse themselves from critically 
evaluating his or her own work that might serve as the basis for 
a recommendation, especially when that recommendation 
could have economic implications [13]. 

Dissemination and implementation strategies are critical for the 
success of any CPG. Unfortunately CPG do not implement them¬ 
selves [14] and simple dissemination does not impact change. It can 
be very helpful to develop a committee with expertise in medical 
education and behavior change. Given that behavior of physicians 
and other health care providers is difficult to alter, and the knowl¬ 
edge that adult learning is complex and variable, dissemination and 
implementation should involve multiple strategies including several 
or all of the following: 

• Involve users in the CPG development process. 

• Compile short summaries with key messages. These can be 
web based or in the form of brochures or posters or similar 
educational materials. 

• Use professional journals including peer-reviewed journals and 
publications by relevant societies or interest groups. 

• Use the education processes of relevant annual meetings, CME 
events, or universities; discuss the CPG at conferences and 
seminars. 

• Ask respected clinical leaders to promote the CPG including 
the use of academic detailing. 

• Develop tools to utilize the CPG within routine procedures 
such as quality assurance. 

• Develop information technology tools to incorporate the CPG 
within practice-based computer reminder prompts or decision¬ 
making algorithms. 

• Hire professional communicators. 

• Consider economic incentives including differential fees for 
achieving CPG specified targets. 

• Consider end-user (i.e., patient) directed advertising. 

No single strategy is effective for all health care providers. To be 
effective, CPG must become embedded in the day-to-day activities 
of the user. 

Before the guideline document is considered final and widely 
distributed, the document should be sent for review to a wider 
group of relevant individuals that did not participate in the devel¬ 
opment of the document. These individuals might include other 
members of the society responsible for the guideline development, 
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2.9 Evaluate 
the Guidelines 


members of associated societies including allied health groups, 
consumer groups and patient support groups, and health authori¬ 
ties and regulatory agencies. Comments received from this exter¬ 
nal review process should be reviewed by the guideline committee 
members and areas of confusion or significant discord should be 
appropriately addressed in the guideline document. 

The Appraisal of Guidelines, Research, and Evaluation (AGREE) 
Collaboration (www.agreetrust.org) has created and validated an 
instrument for clinicians to assess and rate the quality of the CPG 
document itself. The assessment of quality involves judging 
whether the potential biases within the guideline development 
process have been addressed adequately and that the recommenda¬ 
tions are internally and externally valid. Within the AGREE instru¬ 
ment the reader is asked to assess, among other things, the overall 
aim of the guideline, the target population, the degree of stake¬ 
holder involvement, the rigor of the methods used to formulate 
the recommendations, the clarity of the recommendation state¬ 
ments, the organizational, behavioral and cost implications of the 
guidelines, and the editorial independence of the guidelines. 

In addition to the formal evaluation of the CPG document, it 
is important to determine whether the guidelines and the imple¬ 
mentation process have affected the users’ knowledge and behavior 
and whether the guidelines have impacted the desired health 
outcomes. Unfortunately, these steps are challenging and, similar 
to the implementation process, are frequently overlooked. Critical 
components of this process include the following: 

• An assessment of guideline dissemination—this is a relatively 
simple process of counting the number of copies of guidelines 
printed or downloaded, the number of presentations at national 
or local meetings, the number of publications and citations, etc. 

• An assessment of the impact of CPG on user awareness, knowl¬ 
edge, and understanding—specifically designed questionnaires 
directed towards the relevant health care practitioner can be 
helpful for this component. 

• An assessment of whether or not the guidelines have contrib¬ 
uted to changes in clinical practice or health outcomes—ideally, 
analyses of longitudinally collected data should allow for assess¬ 
ment of changes in practice in relation to the guidelines 
(e.g., antihypertensive medication use for blood pressure CPG) 
and an assessment of health outcomes (e.g., stroke mortality). 
Comparisons can be made pre- and post-CPG implementa¬ 
tion. However, caution is required when interpreting out¬ 
comes with prolonged lag times (e.g., change in blood pressure 
control and stroke mortality) and with inferring causality in 
pre/postcomparative studies. 
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2.10 Assess 
the Need to Revise 
the Guidelines 


CPG lose value to clinicians with time as new evidence develops. As 
a result, some guidelines set an arbitrary revision date. In a slowly 
evolving field, the revision date may be premature leading to wasted 
time and money on guideline revision. Conversely, in a rapidly evolv¬ 
ing field, CPG may become outdated before their scheduled revision 
date. Shekelle et al. [15] developed criteria for when a guideline 
needs updating and in a review of 17 CPG published by the US 
Agency for Healthcare Research and Quality, they found that half 
the guidelines were obsolete by 5.8 years. As a result, the authors 
suggested that CPG be assessed for validity every 3 years. Given 
this relatively brief period, it may be worthwhile to consider CPG 
development as an ongoing process rather than a discrete event. 
The Canadian Hypertension Education Program, for example, per¬ 
forms automatic literature searches and reviews new evidence on an 
annual basis [16]. Areas where the guidelines are deemed invalid are 
then updated expeditiously. For this process to work, the majority of 
the guideline work group members remain within the same guide¬ 
line section on a year-to-year basis. Of course this annual revision is 
not applicable to a field in which new evidence is slowly evolving. 
Even in that setting however, the validity of published guidelines 
should be assessed on a regular basis. 


2.11 Recognize 
the Limitations 
of the Guidelines 
and the Guideline 
Development Process 


CPG development is a time-consuming and complex process. 
The cost of CPG development is also an issue, particularly if the 
CPG are developed with rigor as suggested above. Subjectivity is 
inevitable despite the best efforts of committee members to mini¬ 
mize bias and follow published grading schemes. It is also worth 
emphasizing that medicine is an evolving field. At the time of pub¬ 
lication, CPG may be partially outdated particularly in fields that 
are evolving rapidly. Finally, in many circumstances, high quality 
valid evidence simply may not exist. Development of CPG in these 
situations is challenging and sometimes impossible. 


3 Legal Considerations 


3.1 Liability 
of Guideline 
Developers 
and Societies 
Supporting Guideline 
Development 


Guideline developers should demonstrate that they have taken the 
necessary steps to ensure proper preparation of the guidelines. 
These steps are detailed above. It is critical to be transparent about 
evidence retrieval and synthesis of this evidence into guideline 
statements. It is also important to state that CPG are not definitive 
statements and are not meant to replace sound medical 
decision-making which takes into account critical elements rele¬ 
vant to patient care including patient preferences and clinician 
experience. The guideline document should also explicitly state 
the dates of development and the date of final acceptance by the 
sponsoring body. In this way, the recommendations made within 
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the document are legally correct only to that date. If these steps are 
taken, it is very unlikely any legal liability will befall the guideline 
developers or sponsoring society. 

3.2 Liability It is beyond the scope of this document to review the legal liabilities 

of Practitioners of practitioners if CPG are not followed. Briefly, guidelines have 

been produced as evidence of what constitutes appropriate care 
delivered to a patient. However, other factors, including behavior 
in comparison to a standard of peers and provision of information 
around potential risks and benefits, are often considered for actions 
judged negligent. 


4 Conclusion 

The development of CPG is a complex time-consuming process 
involving systemic collection and synthesis of evidence, transpar¬ 
ency, bias minimization, and detailed implementation and evalua¬ 
tion strategies. If developed correctly, CPG can facilitate treatment 
of patients to maximize benefits, reduce harm, and ultimately 
improve patient-relevant health outcomes. 
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Evidence-Based Decision-Making 5: Translational 
Research 

Deborah M. Gregory and Laurie K. Twells 

Abstract 

The delay in turning research into practice for the benefit of patient care has been compared to a “leaky 
pipeline.” In the early 2000s, this delay raised concerns among governmental agencies and other sponsors 
of health services in many countries. Facilitating the translation of basic and clinical research into clinical 
practice through evidence-based decision-making and improving population health is now a major goal of 
health research investment agencies. Translational research or knowledge translation has emerged to 
bridge the gaps between basic and clinical research, and between clinical research and clinical practice. 

Various frameworks and definitions of translational research are presented. We present an example of 
an Integrated Knowledge Translation Team in Bariatric Care, and explain how an integrated knowledge 
translation (iKT) approach was created at the program’s inception. This led to evidence-based decision¬ 
making and subsequent practice change in one area of the health care system. Real-world successes and 
challenges in moving research to practice are discussed. 

Key words Translational research, Translational research frameworks, Integrated knowledge 
translation 


1 Introduction 


It has been frequently stated that it takes 17 years to turn research 
into practice for the benefit of patient care [ 1 ]. The lack of ability to 
apply research findings has sometimes been compared to a “leaky 
pipeline” or funnel between research (scientists) and practice (policy 
makers and practitioners) [1, 2]. In the early 2000s, the gap 
between research and practice, characterized as a “chasm” by the 
Institute of Medicine [3], raised concerns among governmental 
agencies and other sponsors of health services in many countries 
including the USA, UK, and Canada [2]. 

Evidence-based approaches emerged in response to the need 
to improve the quality of health care and to close the gap between 
research and practice [3]. Evidence-based practice , defined as 
“the integration of best research evidence with clinical expertise 
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and patient values” [4], and evidence-based decision-making , 
defined as “the formalized process of using the skills for identify¬ 
ing, searching for, and interpreting the results of the best scientific 
evidence, which is considered in conjunction with the clinician’s 
experience and judgment, the patient’s preferences and values, 
and the clinical/patient circumstances when making patient care 
decisions” [5] have been utilized by a number of disciplines includ¬ 
ing medicine, nursing, and psychology. 

The importance of translation of research knowledge to effective 
clinical treatment is well known [6] and considered essential to the 
public good [7]. Facilitating the translation of basic and clinical 
research into clinical practice and everyday decision-making to 
improve population health is a priority of health research invest¬ 
ment agencies such as the Canadian Institutes of Health Research 
(CIHR) [8], the National Institutes of Health [9] and the Agency 
for Healthcare Research and Quality [10] among others in the 
USA, and the Medical Research Council [11,12] of the UK. It has 
become increasingly important to demonstrate the investment of 
money spent on health research has moved research into practice 
and policy. Simply providing research evidence at scientific confer¬ 
ences or meetings and through publications in scholarly journals, 
while important, is not perceived as being adequate to ensure 
appropriate knowledge use in decision-making. Brownson et al. 
[13] stated “.. .too often, discovery of new knowledge begets more 
discovery (the next study) with little attention on how to apply 
research advances in real-world public health, social service, and 
health care settings.” 

The process of transferring research evidence from basic science 
into clinical research and from clinical research into clinical prac¬ 
tice has been coined by researchers as translational research, knowl¬ 
edge translation, knowledge transfer and exchange, knowledge to 
action, research utilization, and dissemination and implementation 
research. McKibbon et al. [14] identified as many as 100 terms used 
to describe the process of putting knowledge into action. Although 
the terminology can be confusing, the underlying rationale for 
each is comparable and focuses on bridging the gap between 
research and practice. 


2 What Is Transitional Research? 

The term “translational research” appeared in the literature as early 
as 1993, but there was limited reference to the term during the 
1990s [15]. It has been referred to as the “new buzzword” in the 
health care research field [16]. In a commentary published in 
2008, Woolf stated “translational research means different things to 
different people, but it seems important to almost everyone’s” [17]. 
The author suggested that for many it referred to the knowledge 
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2.1 Translational 
Research Frameworks 


generated from “bench to bedside” or from basic science to 
clinical medicine. For health services and public health researchers 
it referred to “translating research into practice; i.e., ensuring that 
new treatments and research knowledge actually reach the patients 
or populations for whom they are intended and are implemented 
correctly” [17]. Although translational research has not been 
clearly defined [15], numerous attempts have been made to do so 
by various fields [18] including medicine, nursing, and 
psychology. 

Various translational research and knowledge translation frame¬ 
works and definitions can be found in the literature. In this section, 
we present an overview of the frameworks referred to as transla¬ 
tional research and in the following section we specifically focus on 
the framework we use to guide our translational research program. 
This knowledge translation framework is utilized by Canada’s fed¬ 
eral health research agency—the Canadian Institutes of Health 
Research. 

Translational research has been referred to as a process of taking 
findings from basic research or clinical research and using them to 
produce innovation in healthcare settings and is also used to define 
research which involves both basic and applied research [11]. Thus 
there at least two levels of translational research. The first level (Tl) 
was defined as “The transfer of new understanding of disease mecha¬ 
nisms gained in the laboratory into the development of new meth¬ 
ods for diagnosis, therapy, and prevention and their first testing in 
humans” (170). The second level (T2) was described as “Translation 
of results from clinical studies into everyday clinical practice and 
health decision making” [17]. Fiscella et al. [19] have referred to 
Tl research as preclinical and further subdivided it into short-term 
(lasts up to 5 years) and long-term (lasts from 5 to 10 years). 
The second level was defined as applied clinical research that is 
clinician-focused, patient-focused, and community-focused. 

Lean et al. [20] suggest three phases of translational research. 
Phase 1 is “from bench to bedside,” phase 2 “examines how find¬ 
ings from clinical science function when they are applied routinely 
in practice,” and phase 3 “incorporates research processes to evalu¬ 
ate the complex interacting environmental and policy measures 
that affect...sustainability of clinical and public health strategies” 
[20]. Phases 2 and 3 equate to level 2 proposed by Woolf [17]. 

Westfall et al. [18] proposed further dividing the second phase 
of translational research, defining T2 as research to develop 
evidence-based recommendations and policies and T3 as research 
on implementing and disseminating evidence-based interventions 
in practice [18]. In a commentary by Dougherty and Conway in 
2008, an extension of the translational framework was suggested 
to include quality improvement research to evaluate how to deliver 
high-care quality consistently and effectively [21]. 
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In 2007, Khoury et al. [22] presented a framework for the 
continuum of multidisciplinary translation research which revolved 
around the development of evidence-based guidelines. The authors 
presented four phases of translational research: Tl: discovery to 
candidate application; T2: health application to evidence-based 
guidelines; T3: evidence-based practice guidelines to health practice; 
T4: practice to population health. The model depicts a logical pro¬ 
gression from Tl to T4 research, but the process is not necessarily 
linear. Augurs-Collins et al. later adapted the translational research 
framework to obesity genomics research and emphasized the cen¬ 
tral role that knowledge synthesis plays in translational research 
[23]. Khoury et al. [24] coined the term 'Translational epidemiol¬ 
ogy” to highlight the role of epidemiology in translating scientific 
discoveries into population health impact. Using human genomics 
as an example, Khoury suggested that epidemiology has a role to 
play in each of four phases of translational research. In Tl, epide¬ 
miology explores the role of a basic discovery (e.g., a disease factor 
or biomarker) in developing a candidate application for use in prac¬ 
tice (e.g., a test to guide interventions). An example from genom¬ 
ics would involve assessing the prevalence, associations, 
interactions, sensitivity, specificity, and predictive value of testing 
for genetic risk factors. In T2, epidemiology can help to evaluate 
the efficacy of the candidate application by using observational or 
experimental studies. This would involve assessing the clinical util¬ 
ity of genetic risk factors in improving health outcomes. In T3, 
epidemiology can help to assess facilitators and barriers for uptake 
and implementation of candidate application in practice, for 
example, assessing the factors associated with implementation of 
BRCA testing in practice. In T4, epidemiology can help to assess 
the impact of using candidate applications on population health 
outcomes, for example assessing the effectiveness of newborn 
screening programs. Epidemiology also has a leading role in knowl¬ 
edge synthesis, especially using quantitative methods (e.g., meta¬ 
analysis) [24]. 

Translational research is a process promotes the multidirectional 
and multidisciplinary integration of basic research, patient-oriented 
research, and population-based research, with the long-term aim of 
improving the health of the public [15]. 


3 Translation of Research in Canada 

Canada’s national health research investment agency the Canadian 
Institutes of Health Research (CIHR) was created in 2000 under 
the authority of the CIHR Act [25]. It consists of four pillars of 
health research: biomedical, clinical, health systems, and population 
health. Translation of research is embedded in its mandate. A key 
focus of the agency is "knowledge translation that facilitates the 
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application of the results of research and their transformation into 
new policies, practices, procedures, products and services” [8]. To 
promote the movement of research into practice researchers must 
incorporate a knowledge translation plan in their funding propos¬ 
als and dedicate funds for the same. Additionally, knowledge users 
such as decision-makers must demonstrate the use of evidence in 
planning and priority setting. 

CIHR defines knowledge translation as “a dynamic and itera¬ 
tive process that includes the synthesis, dissemination, exchange, 
and ethically-sound application of knowledge to improve the health 
of Canadians, provide more effective health services and products, 
and strengthen the healthcare system.” It involves “interactions 
between researchers and knowledge users that may vary in intensity, 
complexity and level of engagement depending on the nature of the 
research and the findings as well as the needs of the particular 
knowledge user.” (http://www.cihr-irsc.gc.ca/e/39033.html). 

Knowledge translation (KT) at CIHR is described by two 
categories—end-of-grant knowledge translation and integrated 
knowledge translation (iKT). The first involves initiatives under¬ 
taken once the research project has been completed and the second 
is combined into the research process (http://www.cihr-irsc. 
gc.ca/e/39033.html). In end of grant KT, the researcher develops 
and implements a plan for making knowledge users aware of the 
knowledge that was gained during a research project. Therefore, 
end of grant KT includes the typical dissemination and communi¬ 
cation activities undertaken by most researchers, such as KT to 
their peers through conference presentations and publications in 
peer-reviewed journals. End of grant KT can also involve more 
intensive dissemination activities that tailor the message and 
medium to a specific audience, such as summary briefings to stake¬ 
holders, interactive educational sessions with patients, practitioners 
and/or policy makers, media engagement, or the use of knowledge 
brokers ( http: //www. cihr-irsc. gc. ca/e/3903 3. html ). 

According to the CIHR, “The term integrated KT describes a 
different way of doing research with researchers and research users 
working together to shape the research process starting with col¬ 
laboration on setting the research questions, deciding the method¬ 
ology, being involved in data collection and tool development, 
interpreting the findings and helping disseminate the research 
results. This approach also known as collaborative research, action- 
oriented research, and co-production of knowledge, should pro¬ 
duce research findings that are more likely to be relevant to and 
used by the end-users” (http://www.cihr-irsc.gc.ca/e/39033. 
html). It is more likely this process will result in knowledge users 
such as policy and decision-makers, clinicians or the public using 
the results in everyday decision-making. 

CIHR has adopted Graham and colleagues Knowledge to Aetion 
Cyele for promoting the application of research to ensure that new 
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knowledge generates action to improve health or health care 
services and a framework for the process of knowledge translation 
(http://www.cihr-irsc.gc.ca/e/39033.html). The “Knowledge to 
Action Cycle” requires identifying the problem and selecting the 
relevant knowledge; adapting the knowledge to the local context; 
assessing the determinants of knowledge use (barriers and sup¬ 
ports); selecting, tailoring, implementing, and monitoring knowl¬ 
edge translation interventions; evaluating outcomes or impact of 
knowledge use; and determining strategies for ensuring sustained 
knowledge use. 


4 The Newfoundland and Labrador Integrated Knowledge Translation Team 
in Bariatric Care: A Case Study 


4.1 Context In Newfoundland and Labrador (NL), one in every three adults is 

obese (BMI >30); the highest rate of obesity in Canada. Within 
this group, the prevalence of individuals that are excessively obese, 
classified as either class II (BMI > 35) or class III obese (BMI >40) 
is high and continued increases are projected [26]. These excessive 
weight categories increase the risk of developing chronic condi¬ 
tions such as hypertension, type 2 diabetes, and cardiovascular dis¬ 
ease, impair quality of life, place a substantial burden on the health 
system, and put individuals at a higher risk of premature mortality 
[26-30]. In Canada, it is estimated that obesity costs the health 
care system between $3.9 and $4.3 billion dollars in direct and 
indirect medical costs [31, 32]. 


4.1.1 Bariatric Surgery Bariatric or weight loss surgery is recommended as a medically effec- 
in Newfoundland tive treatment for class II (BMI>35kg/m 2 + comorbid condition) 

and Labrador and class III obesity (BMI > 40 kg/m 2 ) herein referred to as morbid 

obesity, for individuals who demonstrate unsuccessful weight loss 
attempts [27]. It is considered superior as a treatment for morbid 
obesity when compared to any other intervention (e.g., lifestyle, 
medical management, behavioral, pharmacologic) resulting in 
substantial and sustainable weight loss, improved quality of life, and 
reduced likelihood of premature mortality [33-37]. 

In May 2011, Eastern Health (EH), the largest of four inte¬ 
grated regional health boards in the province of Newfoundland 
and Labrador started offering laparoscopic sleeve gastrectomy 
(LSG), a type of bariatric surgery, to eligible patients from the 
entire province (population 510,000) estimated at 100-150 sur¬ 
geries annually. LSG is a non-reversible procedure resulting in the 
removal of approximately 80 % of the stomach, leaving a much 
smaller stomach or “sleeve” [27, 38]. LSG is a relatively new type 
of bariatric surgery, until recently considered “investigational,” but 
now a stand-alone procedure that encompasses almost 20 % of all 
bariatric surgeries in North America [38, 39]. 
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4.1.2 Establishment 
of a Multidisciplinary 
Clinical Team in Bariatric 
Surgery 


4.2 Opportunity 
for Research: 

A Window 
of Opportunity 


4.2.1 Development 
of an Integrated 
Knowledge Translation 
Team in Bariatric Care 


According to Canadian Clinical Practice Guidelines, successful 
treatment for morbid obesity is more likely when multidisciplinary 
health care providers are involved in patient care pre- and post¬ 
surgery [27]. Consequently, a multidisciplinary clinical team was 
established at EH and is comprised of three general surgeons 
trained in bariatric surgery, a nurse practitioner (NP), and a dieti¬ 
cian with access to other medical specialties via consultation. 
Potential patients referred by their primary health care provider via 
a standardized referral form are screened by the NP. If eligible, 
patients are invited to attend an educational session presented by 
the NP and dietician on topics that include: overview of obesity 
and associated comorbidities, program and patient weight loss 
expectations, review of surgery including risks and dietary teach¬ 
ings. If interested, patients meet with the NP and dietician for fur¬ 
ther assessment and teaching. If deemed an eligible surgery 
candidate, patients are booked to meet with the surgeon for official 
consent. Once surgery has been performed patients are followed 
up post-surgery: 4-6 weeks, 3, 6, 12, 18, and 24 months and 
annually thereafter. Patients experiencing challenges are offered 
follow-up more frequently and have access to the NP and dietician 
through e-mail and phone contact. 

In a joint report “Developing a Research Agenda to Support 
Bariatric Care” published by the CIHR and the Canadian Obesity 
Network in 2010, the gaps in research related to bariatric surgery 
were highlighted [40]. Although LSG was not specifically high¬ 
lighted in the report, compared to other bariatric surgeries, research 
on LSG as a treatment for morbid obesity is limited due to its rela¬ 
tively recent advent. In 2003, in Canada and the USA there were 
no LSGs performed compared to 19,486 in 2011 [39]. There is 
limited research on LSG in Canada and elsewhere on (1) experi¬ 
ences of patients who choose bariatric surgery as a modality for 
treatment of their morbid/clinical obesity, (2) patient expectations 
of weight loss as a result of undergoing surgery, (3) mid to long 
term health outcomes 3-5 years and beyond, and (4) patient- 
reported health outcomes post-surgery. Although there is an 
increasing body of research reporting on the short to intermediate 
time period post-surgery (2-5 years), long-term (>5 years) data is 
limited [41, 42]. The start-up of a bariatric surgery program offer¬ 
ing LSG as a treatment for morbid obesity combined with the lim¬ 
ited research on LSG provided an opportunity for research. 

Initial contact with academic researchers was initiated by one of 
the surgeons trained in bariatric surgery (DP). As academic 
researchers trained in the areas of evidence-based medicine and 
clinical epidemiology, our approach to working together was based 
on the belief that research knowledge is more likely to be used by 
knowledge users (e.g., health professionals, decision- makers, 


462 


Deborah M. Gregory and Laurie K. Twells 


policy makers) if they are engaged early in the process and establish 
an ongoing relationship with the researchers. With the goal of 
establishing a sustainable, long-term program of research that 
would provide relevant information to knowledge users, we 
decided to use the CIHR iKT approach (http://www.cihr-irsc. 
gc.ca/e/39033.html) as a guide in establishing our team. 

Consequently an Integrated Knowledge Translation Team was 
established at Memorial University to develop a program of research 
focused on bariatric care. This team is comprised of academic 
researchers at Memorial University, health care professionals and 
decision-makers from the Surgical Program at Eastern Health, and 
policy makers at the Department of Health and Community Services 
in the provincial government. In addition, it includes database man¬ 
agement experts from Eastern Health and data linkage specialists 
from the NL Centre for Health Information as well as a number of 
trainees. A partnership has been developed with researchers from 
other provinces, as well as, national knowledge translation experts 
from the Canadian Obesity Network. Extensive consultation with 
local stakeholders took place. In addition, we applied for and received 
a CIHR meeting, planning, and dissemination grant [43] to bring 
experts in the field of bariatric care to Memorial University and 
Eastern Health in order to advise on future research directions. 
Subsequently our translational research in bariatric care focused on 
capturing the patient’s total experience of waiting for, undergoing, 
recovering from, and adjusting to life after bariatric surgery. 
The program also assesses not only the clinical outcomes post¬ 
surgery, but patient-reported outcomes (perceptions of physical, 
emotional, and psychosocial health and well-being) and how these 
relate to the overall success of the surgical intervention. This research 
process is dynamic and iterative and includes: 

• Determining research capabilities. 

• Identifying gaps in the research literature or those relevant to 
the health system. 

• Deciding on research questions. 

• Writing research grants/obtaining funding. 

• Conducting research. 

• Translating research findings to knowledge users. 

• Evaluation of interventions undertaken as a result of the evi¬ 
dence provided. 

Our vision is to promote the utilization of integrated KT in the 
development of a more evidence-informed practice in the area of 
bariatric care with the goal of improving patient health outcomes 
and enhancing population health through improved treatment 
options for morbid obesity in the population. 


Evidence-Based Decision-Making 5: Translational Research 


463 


4.2.2 Research 
Objectives 


4.2.3 Translational 
Research in Action 


Our program of research covers the continuum of bariatric care. 
Specific research objectives include examining (1) the waiting 
period for surgery and patients’ goals and weight loss expectations 
post-surgery, (2) health outcomes post-surgery such as weight 
loss/weight regain, resolution/regression of comorbid conditions, 
changes in quality of life, and the impact on the health system, (3) 
patients’ perceptions and definitions of success after surgery, (4) 
developing and implementing interventions, and (5) evaluating 
interventions and applying new knowledge. Future research will 
focus on the development of interventions, including an evaluation 
process that is both formative and summative in order to improve 
patient outcomes and the effectiveness of health services delivery 
in bariatric care, and to ensure value for money within the publi- 
cally funded health care system. 

Since its inception in January 2011, our translational team has 
been engaged in five research projects. Four of the projects are 
completed to date: a qualitative study on patients’ experiences with 
waiting for bariatric surgery, a qualitative study on patients’ per¬ 
ceptions of their health and well-being following surgery and defi¬ 
nitions of success, a quantitative study on patients’ goals and 
weight loss expectations post-surgery and a quantitative study on 
projecting future obesity rates in NL [26, 44-46]. In order to 
accelerate the use of applicable study findings into practice, early 
communication was established through regular team meetings, 
ongoing e-mail correspondence and the development of a 
SharePoint database. The information flow between researchers 
and knowledge users is bi-directional. Some information moves 
from research to clinical while other information moves from the 
clinical arena to research (see examples in Table 1 ). We have described 
one example of how clinical practice changed as the result of infor¬ 
mation sharing between the researchers and knowledge users. 
During a meeting of the integrated research and clinical team, the 
clinical team identified an observation that over time patients tended 
not to attend follow-up appointments. The research evidence on 
patient compliance with follow-up suggests that patients who do 
well post-operatively and sustain weight loss are those that continue 
with long-term follow-up and receive support from a multidisci¬ 
plinary team compared to those who do not comply with regular 
follow-up appointments. Although surgery results in short-term 
weight loss in the majority of patients, attendance at regularly 
scheduled follow-up appointments may further increase long-term 
effectiveness. The findings of the qualitative studies undertaken by 
the team’s researchers with patients before and after surgery 
provided evidence for this clinical finding and also elucidated the 
problem by explaining that patients were not adhering to the 
assigned schedule of follow-up visits for several reasons including 
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Table 1 

Integrated knowledge translation activities 


Issue/finding Communication Outcome/intervention 


Over time patients not 
attending follow-up 
appointments 

Ensuring patient follow-up 
critical to ensuring 
success of program and 
research objectives 

Unrealistic post-surgery 
weight loss expectations 
(refs. 44, 46) 

Weight loss appears to be 
variable 

Some patients doing really 
well, others regaining 
much the weight lost 

Ensure sustainability of 
program funding 

Ensure sustainability of 
research program 


Early identification by clinical 
team in team meetings 


Finding in qualitative and 

quantitative studies on waiting 
for bariatric surgery 

Identified by clinical team during 
appointments and by research 
team during data analysis 


Joint concern of the research and 
clinical teams resulting in an 
integrated effort to provide 
evidence to decision-makers 
and policy makers on the 
effectiveness of the program 


Research findings take time Joint concern of research and 
to be published clinical teams 


Introduction of TeleHealth to facilitate 
follow-up visit for patients 
Early evaluation—reducing loss to 
follow-up 


Realistic weight loss expectations 
post-surgery emphasized by NP in 
formal education sessions 

Initiated study on changes in ghrelin, 
a gut hormone that induces hunger 
pre and post-surgery 


Research funding obtained to develop 
clinical database to house all 
program data in order to produce 
report cards for all knowledge users 
Linkage with the NL Centre for 
Health Information to determine 
long term health outcomes 

KT opportunities with decision¬ 
makers lead to fast decision-making 


abhorrent costs associated with travel, food and lodgings, and the 
perception that follow-up could be completed within their own 
health region. The integrated knowledge translation team recog¬ 
nized that ensuring patient follow-up was critical to ensuring the 
success of the program and the research objectives. As a result 
potential interventions to promote follow-up visits were explored 
and identified. One intervention implemented by the provincial 
bariatric surgery program’s NP was the introduction of TeleHealth 
in a number of locations throughout the province to facilitate 
follow-up visit for patients, allowing patients to stay in their own 
health regions and reducing out-of-pocket costs of travel. This 
intervention will be formally evaluated for its effectiveness in pro¬ 
moting compliance with follow-up; however, early indications are 
that a reduction in the number of patients lost to follow-up has 
occurred. 
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4.2.4 
to Date 


A number of factors or enablers help support the success of an 
integrated knowledge translation team [47]. These enablers 
include but are not limited to: 

• A receptive environment. 

• Adequate tools and resources (e.g., IT infrastructure, research 
staff). 

• Formal recognition and rewards. 

• Developing processes for timely, relevant information (e.g., 
regular meetings to disseminate and discuss findings before 
being presented at local, national and international 
conferences). 

• Building the capacity of decision-makers to better use the 
research findings (e.g., creation of the clinical database and 
report cards). 

• Access to and regular communication with knowledge users 
and policy makers. 

• A multidisciplinary team of academic/clinical experts. 

• Training and mentoring of graduate students and trainees. 

Just as there are enablers to a successful team there are chal¬ 
lenges and these include [48]: 

• Becoming a team member. 

• Understanding and accepting different agendas and timeframes. 

• Building trust. 

• Sharing of power and authority. 

• Respecting the viewpoints of others. 

• Being FLEXIBLE and accommodating unexpected events. 

• Working on solutions to issues that emerge requires time and 
effort to ensure sustainability. 

• Changing team composition. 

Our Experience Some of these enablers are out of one’s control but others may be 

under one’s influence. For example, our team is fortunate to be part 
of a receptive environment that sees the value in evidence-based 
decision-making [48]. As well we, have access to our knowledge 
users and policy makers, which is more likely in smaller populations 
or geographical regions like ours. We have made communication 
with our team a priority. The primary contact for keeping the lines 
of communication open between researchers, clinicians, decision¬ 
makers and policy-makers is the responsibility of the lead investigator. 
In addition, regular team meetings, ongoing informal interactions 
with all stakeholders, and early and ongoing sharing of research 



466 


Deborah M. Gregory and Laurie K. Twells 


findings to the entire team supports continued engagement of team 
members. The biggest challenge we have faced thus far is changing 
team composition (e.g., key knowledge users moving to different 
and unrelated positions). 


5 Concluding Remarks 

For integrated knowledge translation to be successful, researchers 
must engage and integrate potential knowledge users (researchers 
from different disciplines, decision-makers, policy makers, clinicians) 
in the research process from the start, develop a collaborative 
approach to research that is action-oriented and impact-focused, 
and be part of a receptive environment that supports evidence- 
based practice and policy [47]. 
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Evidence-Based Decision-Making 6: Utilization 
of Administrative Databases for Health Services Research 

Tanvir Turin Chowdhury and Brenda Hemmelgarn 

Abstract 

Health-care systems require reliable information on which to base health-care planning and make decisions, 
as well as to evaluate their policy impact. Administrative data provide important information about health 
services use, expenditures, clinical outcomes, and may be used to assess quality of care. With increased 
digitalization and accessibility of administrative databases, these data are more readily available for health 
service research purposes, aiding evidence-based decision-making. This chapter discusses the utility of 
administrative data for population-based studies of health and health care. 

Key words Administrative databases, Health service research, Population-based studies 


1 What Are Administrative Data? 

In general terms, a database is any compilation of information on 
characteristics and events stored in an organized manner, which 
can be used to analyze and answer a specific question [1-3]. 
Administrative data are collected for purposes other than research, 
by governments or specific programs, but can be used for research 
purposes. Some examples of such data recorded by administrative 
systems are vital statistics records, census data, worker’s compensa¬ 
tion records, insurance claim records, etc. Administrative health 
databases collect information on individuals registered with health¬ 
care plans or utilizing health services [4, 5] including—tracking 
service use, monitoring quality of health-care delivery, as well as 
tracking payments and health plan enrollment. Depending upon 
the source, such information may include the characteristics of 
inpatient and outpatient encounters, physicians’ visits, provision of 
home care, stays in chronic and acute care facilities such as nursing 
homes and hospitals [6, 7], or prescriptions. While not produced 
explicitly to examine the health or health care of populations, 
administrative data nevertheless offer important advantages for 
research. Administrative health databases represent large groups of 
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the population, sometimes an entire population defined by 
geographic locality (e.g., all persons hospitalized in a province). 
These databases allow linkage of information concerning individuals 
between different databases, and linkage of information over 
periods of time can allow for longitudinal studies over extended 
periods and various health-care settings. These types of databases 
already exist in the governmental infrastructure and are relatively 
inexpensive to acquire, and can be relatively easily used. 


2 Potential Sources of Administrative Health Data 

Administrative health data are generally derived from documentation 
of health status monitoring or health-care delivery. There are 
generally government registries of various types including popula¬ 
tion and disease registries. A common registry, which exists in most 
countries, is the registry of vital events such as births and deaths. 
These can be used to provide a range of demographic statistics 
often in combination with census or survey data. Other adminis¬ 
trative data may be available from government programs that 
provide entitlements or benefits (e.g., social security). There are 
also administrative systems that provide details of transactions 
regarding health-care expenditures. Data on hospital visits may 
provide useful information on morbidity for specific diseases. 
Statistics regarding health-care related costs are generally taken 
from Government financial statistics, private health-care providers’ 
records, or health insurance records. Table 1 provides an overview 
of some administrative health databases across regions. 


3 Administrative Health Database Creation 

As health administrative data are collected for purposes other than 
research, there are challenges in converting these data for research 
use while providing accurate and valid estimation of disease and 
risk. Administrative health data in its raw form may not be suitable 
for immediate analysis. Researchers and analysts often need to 
manipulate the data into a more readily analyzable form prior to 
further use. Depending on the source of the data, raw data may 
include specific variables such as patient identifiers, demographics, 
clinical information on diagnosis, comorbidities and prescriptions, 
service utilization, hospital costs, and physician billing data [2, 6]. 
These data are used to derive new variables for more sophisticated 
analyses and evaluations, for instance to define an outcome based 
on a validated algorithm, the use of patients’ postal code to derive 
travel distance to care facilities [9, 10], frequency of prescription 
refills to assess therapy adherence [11] and data on race to deter¬ 
mine variation and access to care across ethnic groups [7]. 
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Table 1 

Examples of administrative databases 3 


Country 

Database 

coverage 

Type of database 

Type of information available 

Canada 

Provincial Health 
Authority 
Databases 

Physicians claims 

Date, location of service, diagnostic 
code, provider specialty, cost 



Inpatient encounters 

Admission and discharge dates, 
diagnostic and procedure costs, 
costs, case-mix group 



Ambulatory care 

Date, nature and location of service, 
diagnostic and procedure costs, 
costs, case-mix group 



Medication 

Formulary drugs, prescription date, 
cost, and quantity 



Registry 

Date of birth, gender, address 

USA 

Age 65 and older 
and younger 
people with 
disabilities 

Medicare 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes, 
ambulatory care, prescription 


American Veterans 

Veterans Affairs 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes, 
ambulatory care, prescription 


Low income 
residents 

Medicaid 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes, 
ambulatory care, prescription 


Regional and 
privately funded 

Kaiser Permanente 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes, 
ambulatory care, prescription 

UK 

Primary care 

General Practice 
Research 

Database 

Demographics, diagnoses, 

prescriptions, referrals, smoking 
status, height, weight, 
immunizations, laboratory results 


Primary care 

The Health 
Improvement 
Network 

Demographics, diagnoses, 

prescriptions, referrals, smoking 
status, height, weight, 
immunizations, laboratory results, 
physicians’ notes 

Sweden 

National 

Swedish National 
Cause of Death 
Register 

Demographics, underlying cause of 
death, comorbidities. 


(continued) 
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Table 1 
(continued) 


Country 

Database 

coverage 

Type of database 

Type of information available 


National 

Swedish Hospital 
Discharge Register 

Diagnoses, procedure codes, costs, 
length of stay in hospitals 

The 

National 

The National 

Patient data, admission and discharge 

Netherlands 


Hospital 

Discharge Register 

data, diagnoses, surgical procedures, 
and the medical specialties 

Finland 

National 

Finnish Hospital 
Discharge Register 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes 

Australia 

Regional 

Western Australian 
Health Services 
Research Linked 
Database 

Birth records, midwives’ notifications, 
cancer registrations, inpatient 
hospital morbidity, inpatient and 
public outpatient mental health 
services data, and death records 


National 

National Hospital 
Morbidity Database 

Diagnoses, procedure codes, costs, 
length of stay in hospitals, 
comorbidities, outcomes 


National 

Medicare Australia 

Patient data, claims, provider 

information, service provided, cost 


National 

Pharmaceutical 

Benefits Scheme 

Patient information, prescription, 
medication description, related cost 

Japan 

National 

Japanese Diagnosis 
Procedure 
Combination 
Inpatient Database 

Patients’ age and sex, diagnoses, 
procedures, drugs and devices used, 
lengths of stay, inhospital mortality 


a Modified and extended from Bello et al. [8] 


The overall quality of information contained in administrative 
databases varies widely. For some conditions and procedures that 
are explicitly identifiable, such as stroke and myocardial infarction, 
coding is reasonably good [12, 13]. On the other hand, it is poor 
for conditions which are more nonspecific, such as chronic kidney 
disease [14]. Further, variability of structural components of the 
data (e.g., number of data fields available for entries) may also 
influence accuracy. Additionally, complex issues (e.g., filtering of 
data by those doing the coding) might also influence accuracy. The 
coding detail may also be influenced by payment procedures. 
Because health-care providers are paid for specific procedures, 
procedures are generally coded more accurately and completely 
than diagnoses. A procedure with a high remunerative value 
has a greater probability of being coded properly than a less 
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remunerative procedure [15]. Imperfect coding is especially 
problematical for chronic conditions. It has been reported that 
hospitals under-code chronic diseases, especially for acutely ill 
patients [4, 16, 17]. 

A key step in disease ascertainment using health administrative 
data is to develop a disease definition that uses the diagnostic and 
treatment codes from the physician billing and hospital discharge 
databases to identify individuals with the disease [14, 18, 19]. The 
combined input of clinical experts and knowledgeable data analysts 
is used to ensure validity of the definition. This process can often 
involve several iterations of a given methodology and developing a 
single case ascertainment definition can be a sizable research 
project on its own. New case definitions should then be validated 
using chart review or population health surveys that are considered 
the gold standard. Thus, the accuracy of health administrative data 
depends not only on the quality of the data but also on the explicit 
condition being identified and the validity of the coding algorithm 
in the patient group [12, 14, 18, 20]. 

It is important to note that administrative health databases 
differ across countries, regions and groups due to varying health 
policies, governance structure, technological facilities, and socio¬ 
economic settings [2, 4, 6, 21, 22]. For example, the comprehen¬ 
siveness of information from administrative databases observed in 
Canada is mainly due to the existence of universal health-care 
coverage across all provinces. Available data include measures of 
disease burden, health-care distribution, prevention activities, out¬ 
come measurements and assessment of effectiveness of interventions 
[5,23,24]. 


4 Using Administrative Data for Research Purposes 

Despite their administrative origins, these data have provided 
important insights into health-care practices. More than four 
decades ago, Wennberg and Gittlesohn [25, 26] used hospital 
discharge data to expose wide variations in rates of expensive med¬ 
ical interventions across small geographic areas with ostensibly 
similar populations. In the late 1980s, these and other previously 
unexplained variations (e.g., in hospital mortality rates, also identi¬ 
fied using administrative data) precipitated an “era of assessment 
and accountability” in American health care [21, 27]. During the 
early 1990s Gabriel et al. estimated medical and nonmedical costs 
incurred among a population-based prevalence cohort of individu¬ 
als with osteoarthritis where osteoarthritis status for each indi¬ 
vidual was ascertained from a physician diagnosis variable [28]. 
Administrative data figured prominently in plans to assess the 
effectiveness of outcomes of care rendered in communities. The 
Agency for Health Care Policy and Research stipulated the use of 
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large administrative databases to examine the “outcomes, effective¬ 
ness, and appropriateness” of health care services with flagship 
administrative data projects initiated by the Patient Outcomes 
Research Teams (PORTs) [29-31]. Also, some European countries, 
notably Denmark, the Netherlands, Sweden, and Finland, have 
placed much greater emphasis on developing national register data 
since the 1980s, to replace national censuses and major surveys. 
For example, Statistics Denmark bases most of its national statistics 
on “register data” that can be linked both longitudinally and 
between registers of different types (e.g., health, education, 
income). In addition, Danish surveys are frequently supplemented 
with register data on income, health, welfare benefits, housing, etc. 
allowing objective information to be compared to the responses 
from survey responses. 


5 Data Linkage with Other Data Sources 

Reports in the USA [32], Canada [33], the UK [34], and Australia 
[35] have recommended increasing the use of existing data, such 
as administrative source data and clinical registry data, to provide 
comparative clinical performance on health services, hospitals, clin¬ 
ical units, and clinicians for internal use and to consumers via 
publicly accessible media. Although a limited number of patient 
outcomes, such as inhospital mortality, complication, and readmis¬ 
sion rates, are currently available from some administrative data 
sources, obtaining data from several different databases that pertain 
to a specific individual or participant using data linkage is often 
necessary to ensure adequate risk-adjustment and examine a more 
comprehensive range of outcomes for comparison. 

Data or record linkage has been defined as “a process of 
pairing records from two files and trying to select the pairs that 
belong to the same entity [36].” In the UK, 47 % of multicenter 
clinical databases surveyed in 2003 by Black et al. reported that 
they undertook routine data linkage with other databases [37]. 
A review by Evans et al. reported that 68 % of Australian clinical 
registries routinely undertook some form of data linkage to obtain 
outcome information, such as death or disease status, and to assess 
data quality [38]. The use of data linkage in research studies has 
increased almost sixfold within the last two decades. This prolifera¬ 
tion of data linkage is reflected in the establishment of data linkage 
research centers and initiatives in Australia [39], Canada [40, 41], 
and the UK [42, 43]. 

Merging administrative data with other data sources can 
efficiently enrich the overall database. There are various ways in 
which extracts of administrative data can be linked with other data 
sources to create more comprehensive and effective datasets for 
analysis. It is not always easy to combine an administrative source 
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with another source of information. This is especially true when a 
common matching key for both sources is not available and record 
linkage techniques are used. In this case, the type of linkage meth¬ 
odology (e.g., definitive matching or probabilistic matching) is 
selected in accordance with the objectives of the research program. 
Administrative data can be successfully linked with a variety of 
other data sources, for example: 

Linking individual level administrative data with other individual level 
administrative data via a unique identifier or probabilistic matching 
methods (matching personal details like names, date of birth, gender, 
address, etc.). 

Linking individual level administrative data with cross-sectional or lon¬ 
gitudinal survey data usually via matching methods. 

Linking individual level administrative data with contextual informa¬ 
tion on, for example, the neighborhood (postal code based socioeco¬ 
nomic classification in Canada) or organization relevant to the 
individual (e.g., hospital or primary care clinic attended). 


6 Example of a Population-Based Linkage of Health Records in Alberta, Canada: 
Development of a Health Services Research Administrative Database 

An example of broad application of administrative databases to 
health-care planning for patients with chronic diseases is the 
administrative database developed for the Interdisciplinary Chronic 
Disease Collaboration (ICDC, www.ICDC.ca) [23]. This initiative 
has been formed by linking multiple data sources which originally 
were for administrative use; the Alberta Kidney Disease Network 
(AKDN, www.AKDN.info) [41] and Alberta Health (AH) data 
sources (Fig. 1) which includes data on >3 million Albertans. 
Available data allow assessment of risk evaluation, case identification, 
rate and pattern of disease progression, complication rates, and 
associated costs—which can all be used to guide policy direction. 
The ICDC repository was developed by linking laboratory data to 
administrative and other computerized data sources to allow assess¬ 
ment of socio-demographic characteristics, clinical variables, and 
health outcomes. 

A unique provincial health number, provided to all the resi¬ 
dents of Alberta by the provincial government, is used to link 
Alberta Government data with the pan-province laboratory data¬ 
base and a number of other data sources including the provincial 
drugs program databases [23, 41]. The Alberta Health (AH) pro¬ 
vincial health ministry provides basic health insurance to all resi¬ 
dents of the province through a universally available health-care 
plan and the insured residents are included in the AH database. 
This database allows the estimation of the prevalence of chronic 
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Alberta Health 


Population Registry 

Date of birth, Gender, 
First nation status, Postal 
code, Socioeconomic 
status by fiscal year. 


Alberta Health 

Physician Claims 

Date and location of 
service, Diagnosis code 
(ICD-9-CM), Provider 
specialty, Costs. 


Alberta Health 

Ambulatory Care 

Date, nature & location 
of service, Diagnostic & 
procedure codes (ICD-9- 
CM & ICD-10-CA), 
Costs. 



Alberta Health 

Inpatient Encounters 

Admission & discharge 
dates, Diagnostic & 
procedure codes (ICD-9- 
CM & ICD-10-CA), Case 
Mix Group, Costs. 

Alberta Health 

Alberta Blue Cross 

Formulary drugs, date of 
dispensing, quantity 
dispensed, Costs. 

Alberta Health 

Alberta Vital 

Statistics 

Date of death, 
Cause of death. 


(Administrative 




Alberta Kidney Disease Network 
Database 

Laboratory Data 

Results of laboratory tests 
(Serum creatinine, Ale, 
hemoglobin, potassium, 
lipid profile, urine protein) 

Unique identifier allowing 
linkage with other data 
sources 


Alberta Kidney 
Disease Network 
(AKDN) 


Fig. 1 Example of a Computerized Database. Alberta Health (AH) maintains administrative data for Alberta. The 
Alberta Kidney Disease Network (AKDN) has developed a process for retrieval, storage, and maintenance of 
laboratory data and relevant laboratory tests for all patients who have these measurements across the prov¬ 
ince of Alberta. A data repository is created by linkage between AH administrative data and AKDN lab data and 
has been used for assessment of outcomes, including health services utilization and mortality for patients with 
laboratory tests measured. Adapted from Hemmelgarn BR et al. [41] 


disease conditions, continuous assessment of health-care utilization, 
monitoring of the adequacy of the current care through examina¬ 
tion of quality indicators, service deliverables, health outcomes, 
and costs data. 
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The ICDC databases [13, 26] include basic demographic data 
for the residents to provide information on the burden of disease 
and care disparities across various racial and ethnic groups such as 
First Nations (Aboriginal), Asian (Chinese and South Asian) 
ethnicity and also by socioeconomic status [44-46]. The database 
also contains residents’ postal code and this enables the unique 
opportunity for geographic information system analyses on access 
to care issues (e.g., travel distance or time as factor of health-care 
access/utilization) [47, 48]. The drug data available from pre¬ 
scription claims enables studies about medication utilization, med¬ 
ication related costs, and clinical outcomes [41, 49]. Table 2 shows 


Table 2 

Examples of research studies done using ICDC administrative database to address key health issues 


Objectives 

Authors 

Examples of studies conducted using 
the administrative database 

Estimating disease burden 

Hemmelgarn 

Rates of treated and untreated kidney 


et al. [52] 

failure in older vs younger adults 


Turin et al. [51] 

Lifetime risk of ESRD 


Turin et al. [50] 

Chronic kidney disease and life expectancy 

Identification of risk and 

Hemmelgarn 

Relation between kidney function, 

disease /risk stratification 

et al. [54] 

proteinuria, and adverse outcomes 


Shurraw et al. [55] 

Association between glycemic control 
and adverse outcomes in people with 
diabetes mellitus and chronic kidney 
disease: a population-based cohort study 


Turin et al. [56] 

Proteinuria and rate of change in kidney 
function in a community-based 
population. 


Alexander et al. [57] 

Kidney stones and kidney function loss: a 
cohort study 


Tonelli et al. [58] 

Using proteinuria and estimated glomerular 
filtration rate to classify risk in patients 
with chronic kidney disease: a cohort 
study 

Case definition 

Ronksley et al. [14] 

Validating a case definition for chronic 

and validation 


kidney disease using administrative data. 


Clement et al. [20] 

Validation of a case definition to define 
chronic dialysis using outpatient 
administrative data. 

Socioeconomic status, 

Chou et al. [59] 

Quality of care among Aboriginal 

First Nations status, 
ethnicity as risk factor 


hemodialysis patients. 


(continued) 
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Table 2 
(continued) 


Objectives 

Authors 

Examples of studies conducted using 
the administrative database 


Conley et al. [44] 

Association between GFR, proteinuria, and 
adverse outcomes among white, Chinese, 
and South Asian individuals in Canada 


Samuel et al. [45] 

Association between First Nations ethnicity 
and progression to kidney failure by 
presence and severity of albuminuria 

Geographic location as risk 
factor 

Faruque et al. [48] 

Spatial analysis to locate new clinics for 
diabetic kidney patients in the 
underserved communities in Alberta 


Ayyalasomayajula 
et al. [47] 

A novel technique to optimize facility 
locations of new nephrology services for 
remote areas. 


Tonelli et al. [60] 

Association between proximity to the 
attending nephrologist and mortality 
among patients receiving hemodialysis. 

Quantification of utilization 
of physician encounters, 
hospitalization risk and 
complications. 

Ronksley et al. [61] 

Patterns of engagement with the health 
care system and risk of subsequent 
hospitalization amongst patients with 
diabetes 


Rucker et al. [9] 

Quality of care and mortality are worse in 
chronic kidney disease patients living in 
remote areas. 


James et al. [62] 

CKD and risk of hospitalization and death 
with pneumonia. 

Health-care costs 

McBrien et al. [53] 

Health care costs in people with diabetes 
and their association with glycemic 
control and kidney function. 


Manns et al. [63] 

Population based screening for chronic 
kidney disease: cost effectiveness study 


Wiebe et al. [10] 

Adding Specialized Clinics for Remote- 
Dwellers with Chronic Kidney Disease: 

A Cost-Utility Analysis 

Resource utilization 
in health care 

Hemmelgarn 
et al. [64] 

Nephrology visits and health care resource 
use before and after reporting estimated 
glomerular filtration rate 


Manns et al. [65] 

Enrolment in primary care networks: 
impact on outcomes and processes of 
care for patients with diabetes. 

Knowledge translation 

Hemmelgarn 
et al. [23] 

The research to health policy cycle: a tool 
for better management of chronic 
non-communicable diseases. 


(continued) 
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Objectives 

Authors 

Examples of studies conducted using 
the administrative database 


Hemmelgarn 
et al. [66] 

Knowledge translation for nephrologists: 
strategies for improving the identification 
of patients with proteinuria. 

Outcome research 

Turin et al. [67] 

One-year change in kidney function is 
associated with an increased mortality 
risk. 


Turin et al. [68] 

Short-term change in kidney function 
and risk of end-stage renal disease. 


Turin et al. [69] 

Change in the estimated glomerular 
filtration rate over time and risk of 
all-cause mortality 


Hemmelgarn 
et al. [54] 

Relation between kidney function, 
proteinuria, and adverse outcomes. 


specific examples of studies done using the ICDC administrative data 
to address key issues in health services research. Data from hos¬ 
pitalizations, health-care expenditures, emergency room records, 
ambulatory care information, and adverse outcomes are captured 
for analysis in combination with laboratory data for clinical, 
population health, as well as policy-relevant research [50-54]. 


7 Pros and Cons of Using Administrative Data 

Table 3 summarizes the advantages and disadvantages of using 
administrative databases for research. Administrative databases 
have some advantages over data acquired from primary surveys or 
primary data collection studies [2, 4, 12, 14, 21]. Generally, 
administrative data capture a wider population than what is possible 
in primary studies. Also, administrative databases can be used for a 
relatively longer follow-up of the study population. Additionally, 
administrative data are often more cost-effective to obtain than the 
primarily designed studies or surveys [3, 4]. There are some limita¬ 
tions of administrative database usage that need to be considered 
for research purposes. First, administrative data are usually not 
obtained for research purposes [5, 22], thus they may be com¬ 
promised in terms of data quality as well as generalizability for the 
observed estimates. Second, administrative data are limited to 
records obtained for the purposes of reimbursement like physician 
claims data or drug benefit repayment data, or tracking/monitoring 
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Table 3 

Advantages and disadvantages of administrative data 


Advantages of administrative data 

• Already collected for operational purposes, thus no additional costs for data collection purposes. 
Although there are costs for data-management (e.g., extraction, mining, cleaning) activities. 

• Data collection process nonintrusive to catchment population. 

• These data are generally updated on a regular interval. In some scenarios the update is a continuous 
process. 

• This type of data can provide historical information and allow consistent time-series to be 
established. 

• Collected in a consistent way if part of systematic collection. 

• Usually subject to rigorous quality checks for the data collection process. 

• Generally a large number of individuals are covered. 

• Individuals who may not respond to surveys can be captured. 

• Potential for datasets to be linked to various other data sources to develop extensive research 
resources 

Disadvantages of administrative data 

• As these are primarily collected for administrative purposes—the data is limited to uses regarding 
research related to services and administrative questions. For clinical research, this data source has 
limited usage. 

• There is lack of researcher control over contents of the data. 

• Proxy indicators sometimes have to be used. 

• Any changes to administrative processes could change definitions and this can make comparison 
over time difficult. 

• Quality issues with variables less important to the data vendor (e.g., address details may not be 
updated). 

• Data privacy and protection issues are matter of concern for access of this type of databases. 

• Access for researchers is dependent on support of administrative authorities that are the data 
custodians. 


health-care service delivery. Detailed clinical information such as 
blood pressure and lifestyle related factors such as smoking, drink¬ 
ing, exercise or dietary information, as well as patient-centered 
potential factors such as patients’ satisfaction may not be available. 
Third, administrative database use for research purposes has been 
criticized due to the lack of validation for certain characteristics, 
such as diagnosis of chronic obstructive lung disease and mental 
health or other clinical outcomes. The researchers focusing on the 
administrative database usage are working on the development of 
validated algorithms for case or exposure definitions around the 
world. Fourth, lack of researchers’ control on the population 
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selection may result in variable follow-up and potentially non- 
generalizable study populations when compared with well-designed 
population-based cohort studies. 

Although administrative databases are primarily intended for 
health administration and funding, they play an important role in 
health services research including program management, oversight 
and policymaking, examining population health and overall disease 
burden, and quality of care. Recognizing that they may lack some 
detailed clinical information, administrative data have the potential 
to provide a relatively cost-effective, less intrusive, and comprehensive 
resource for research. 
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Evidence-Based Decision-Making 7: 

Knowledge Translation 

Braden J. Manns 

Abstract 

There is a significant gap between what is known and what is implemented by key stakeholders in practice 
(the evidence to practice gap). The primary purpose of knowledge translation is to address this gap, bridging 
evidence to clinical practice. The knowledge to action cycle is one framework for knowledge translation 
that integrates policy-makers throughout the research cycle. The knowledge to action cycle begins with 
the identification of a problem (usually a gap in care provision). After identification of the problem, knowl¬ 
edge creation is undertaken, depicted at the center of the cycle as a funnel. Knowledge inquiry is at the 
wide end of the funnel, and moving down the funnel, the primary data is synthesized into knowledge 
products in the form of educational materials, guidelines, decision aids, or clinical pathways. The remaining 
components of the knowledge to action cycle refer to the action of applying the knowledge that has been 
created. This includes adapting knowledge to local context, assessing barriers to knowledge use, selecting, 
tailoring implementing interventions, monitoring knowledge use, evaluating outcomes, and sustaining 
knowledge use. Each of these steps is connected by bidirectional arrows and ideally involves healthcare 
decision-makers and key stakeholders at each transition. 

Key words Knowledge translation, Evidence to practice gap 


1 What Is Knowledge Translation? 

Given the volume of medical information published on a daily basis, 
physicians and other healthcare providers are not able to keep abreast 
of the generated evidence, including important studies that could 
potentially impact clinicians’ day-to-day practice. This has generated 
a gap between what is known and what is implemented by key stake¬ 
holders in practice (the evidence to practice gap). This gap is rele¬ 
vant to all those who struggle to keep up with evidence: patients, 
providers, healthcare planners, and funders. The primary purpose 
of knowledge translation is to address this gap, bridging evidence 
to practice. 

There is much confusion surrounding the various terms used 
to describe the knowledge translation process, in addition to the 
theories of practice change, and strategies for implementation [1]. 
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Although it is beyond the scope of this review, others have 
documented the various terms that have been used to describe this 
process including knowledge transfer, knowledge exchange, imple¬ 
mentation science, knowledge dissemination, among others [2]. 
The term knowledge translation, which has been formally defined 
by the Canadian Institutes of Health Research as a “dynamic and 
iterative process that includes synthesis, dissemination, exchange 
and ethically sound application of knowledge to improve the health 
of Canadians, provide more effective health services and products, 
and strengthen the health care system [3],” is now agreed upon by 
several Canadian and US funders and institutions and thus will be 
used throughout this chapter. 


2 Frameworks for Knowledge Translation 

The framework for knowledge translation that has gained popularity 
recently is the knowledge to action cycle. This framework integrates 
policy-makers throughout the research cycle, which enhances the 
likelihood that interventions developed will be feasible and scalable 
for health system uptake, if they are demonstrated to offer value for 
money. A Cochrane systematic review demonstrated that interven¬ 
tions tailored to prospectively identified barriers are more likely to 
improve professional practice than no intervention or dissemina¬ 
tion of guidelines [4]. This, in addition to its intuitive appeal and 
extensive use by Canadian funding agencies, has led to extensive 
use of the knowledge to action cycle over the past 10 years by 
researchers and healthcare organizations. 

Other theories and frameworks relating to behavior change have 
been discussed for achieving knowledge translation. Some of these 
frameworks are more applicable to behavior change in large organiza¬ 
tions. For instance, the Institute for Health Care Improvement 
Collaborative model, which emphasizes change through shared 
learning and knowledge by experts and senior healthcare leaders, is 
often used when considering large-scale change within a healthcare 
organization [5]. Further details on this are available elsewhere [5]. 


3 Overview of the Knowledge to Action Cycle 

An overview of the knowledge to action process is presented in 
Fig. 1. Graham and colleagues illustrate how the process of knowl¬ 
edge creation and action intersect, with the goal of bridging the 
evidence to care gap [2]. 

The cycle usually begins at the bottom of the diagram, with 
the identification of a problem: either a gap in care provision or 
because new evidence becomes available suggesting that current 
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Fig. 1 The knowledge to action cycle [2] 


practice is inadequate. Moving in a clockwise fashion, the remaining 
components of the KTA cycle refer to the action of applying the 
knowledge. This includes adapting knowledge to local context, 
assessing barriers to knowledge use, selecting, tailoring implement¬ 
ing interventions, monitoring knowledge use, evaluating out¬ 
comes, and sustaining knowledge use. Each of these steps is 
connected by bidirectional arrows and would ideally involve 
healthcare decision-makers and key stakeholders at each transition. 
Stakeholder engagement increases the likelihood that new infor¬ 
mation will be incorporated into local practice [ 1 ]. 

Knowledge creation is depicted at the center of the KTA cycle, 
as a funnel. Knowledge inquiry is at the wide end of the funnel and 
addresses the multitude of primary studies that inform the prob¬ 
lem in question. At this stage the data has yet to be organized 
into a useful format to inform action. Moving down the funnel, 





488 


Braden J. Manns 


the primary data is synthesized, typically through a systematic 
review or meta-analysis, where the results of relevant studies are 
combined or considered together. Ideally, knowledge synthesis 
results in the creation of knowledge products in the form of educa¬ 
tional materials, guidelines, decision aids, or clinical pathways. 
These product tools are clearer and more concise than a full system¬ 
atic review and can be used to inform stakeholders as they move 
through the knowledge to action cycle. 

Throughout the rest of this chapter, we will use the example of 
timing of dialysis initiation to illustrate the knowledge to action 
cycle. Given that the process of evidence generation, including the 
conduct of randomized trials and the optimal method for system¬ 
atic reviews and meta-analyses, has been covered elsewhere in this 
textbook, some sections of the knowledge to action cycle will be 
presented more succinctly. More attention will be given to the dif¬ 
ferent types of interventions that have been used to influence 
stakeholder behavior. 


4 Timing of Dialysis Initiation: An Example 

Many factors are considered when deciding when to start dialysis 
in outpatients with progressive kidney failure, including lab mark¬ 
ers of kidney function and subjectively reported symptoms that 
develop over time that are related to kidney failure, such as fatigue 
and nausea. While these symptoms are common in patients with 
severe kidney failure, they are often difficult to interpret because 
they can be due to other chronic health conditions common in 
patients with kidney disease. As such, there is no hard and fast rule 
and deciding when a patient should start dialysis has been an ongo¬ 
ing controversy for decades. Past clinical guidelines have generally 
recommended initiation of dialysis when kidney function falls 
below approximately “10 %” [6], 

These guidelines, and the difficulty of attributing patient 
symptoms to kidney failure, may account for the recent increase 
in “earlier” (i.e., at a higher level of kidney function) initiation 
of dialysis in Canada and the United States over the past 10 
years. For instance, the proportion of individuals starting dialysis 
at eGFR> 10 mL/min/1.73 m 2 has increased from 19 % in 1996 
to 45 % in 2005 [7]. When patients are started on dialysis early 
and unnecessarily, this negatively affects patient’s quality of life 
but is also a strain on the healthcare system. Given this, improv¬ 
ing timing of dialysis initiation in outpatients with progressive 
kidney failure was recently selected as a priority for knowledge 
translation by a Canadian national kidney knowledge translation 
network [8]. 
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5 Steps in KTA Cycle 


5.1 Knowledge 
Creation: Knowledge 
Inquiry, Synthesis, 
and Knowledge 
Products 


As previously discussed and illustrated in Fig. 2, the center of the 
KTA cycle represents the process of knowledge creation. The wide 
end of the funnel, knowledge inquiry, encompasses the primary 
studies (including randomized trials) that inform the problem in 
question. In order to generate a useable knowledge product, the 
data needs to be distilled into an organized format that can inform 
subsequent action. 
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Fig. 2 The knowledge to action cycle for improving timing of dialysis initiation 
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The process of organizing studies into a usable knowledge 
product usually involves a systematic review or meta-analysis 
(see Chapter 24), where the results of pertinent studies are com¬ 
bined or considered together to inform action. Some questions, 
particularly those relating to changes in health policy or complex 
interventions, are not suited to conventional systematic reviews. 
In such cases, other approaches are needed—readers are referred 
to other sources [ 1 ]. 

Although systematic reviews may combine the results of 
several studies, making it much easier to consider the full breadth 
of evidence, the results typically are not in a format that are par¬ 
ticularly usable by some groups of stakeholders, including patients 
and healthcare decision-makers. The process of knowledge synthe¬ 
sis, however, ultimately results in the creation of knowledge prod¬ 
ucts (e.g., educational materials, guidelines, decision aids, or 
clinical pathways), which are more clear and concise than a full 
systematic review itself, and can inform stakeholders as they move 
through the knowledge to action cycle. 

With respect to knowledge products, these may take several 
different forms. Patient decision aids are patient friendly tools that 
are meant to educate patients about their treatment options, 
including paper-based booklets, internet-based tools and videos. 
Ideally, they are hinged on high-quality evidence which is summa¬ 
rized in an understandable format for patients. Some patient deci¬ 
sion aids are meant to be used by patients independently and others 
to be used working alongside a healthcare practitioner. Patient 
decision aids often present the risks and benefits of different testing 
or treatment approaches. Decision aids have been shown to influ¬ 
ence behavior [1], though the impact on other types of health 
outcomes is less certain. 

Clinical pathways are tools for health professionals that are 
derived from practice guidelines to aid in providing evidence-based 
health care [9, 10]. While similar to guidelines, clinical pathways 
differ by being more explicit about the sequence, timing, and pro¬ 
vision of interventions and are directly incorporated into routine 
patient care. Pathways can help to improve the quality, consistency, 
and continuity of care and ensure that evidence-based and patient- 
focused care are being provided [11, 12]. Pathways have been 
reported to be effective in supporting care management and guid¬ 
ing clinical interventions and assessments [13-15]. 

5.2 The knowledge 
to action cycle: 

Selecting Priorities for 
Knowledge Translation 


In practice, the knowledge to action cycle usually begins when a 
“problem” is identified. Depending on local circumstances, the 
problem may be identified because new evidence has emerged 
(for instance, from a large randomized trial) suggesting that the 
current standard practice is inadequate. 
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5.3 Adapting 
Knowledge to Local 
Context 


In other situations, evidence to practice gaps will be identified 
based on routinely measured and reported quality indicators, which 
may lead stakeholders to have concerns about current health system 
or provider performance. It should be noted that the strengths and 
limitations of each quality indicator must be considered. For instance, 
as quality indicators are often based on healthcare processes (i.e., 
the proportion of patients with a condition receiving a particular 
medication or test), they are usually measured with routinely avail¬ 
able administrative data, which may have measurement issues, since 
these databases were not developed for research purposes [1]. 
Quality indicators should be valid, reliable, and feasible to mea¬ 
sure, but importantly, they should be linked to important clinical 
outcomes and should be based on best evidence. 

Researchers may also identify gaps in care from administrative 
health or other clinical registry data [16]. As above, it is important 
to consider the validity and reliability of the data, to ensure that the 
measure relates closely to the outcomes of care and that changes in 
these measures will improve outcomes. The data should also be 
representative of the population of interest. 

While identifying an evidence to practice gap is critical to justi¬ 
fying a knowledge translation exercise, other factors should be 
considered before taking on such an exercise. For instance, varia¬ 
tion in care is common in health care, particularly when the evi¬ 
dence in an area is weak. As such, in addition to noting variations 
in care, or suboptimal performance with respect to quality indica¬ 
tors, focusing KT activities in areas where good quality evidence 
exists to guide care is generally recommended. When choosing 
among several candidate problems to focus on, in addition to 
identifying variation in care, and focusing on an area with a strong 
evidence base, prioritizing areas where change may be most feasible 
is also important. 

Clinical practice guidelines or other knowledge tools may or may 
not be available to guide care in an identified problem area, and 
even if available, tailoring knowledge to local circumstances is 
often required. 

Up until recently, no validated process for adapting clinical 
practice guidelines to local use existed. However, the ADAPTE 
collaboration has established a process, including outlining the nec¬ 
essary steps to ensure that guidelines can inform local practice [17]. 
Importantly, the ADAPTE process engages end users in the guide¬ 
line adaptation process to make certain that the end products will 
best serve the stakeholders who will use them. The ADAPTE pro¬ 
cess consists of three main phases: planning and set up, adaptation of 
the guideline, and development of the final products. A web-based 
resource toolkit is available (www.ADAPTE.org). 
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5.4 Assessing 
Barriers 

to Implementation 
of Evidence 


Barriers to implementation of evidence can exist at several levels: 
the patient, the healthcare professional, team, or organization, or 
the wider environment. 

At the patient level, barriers often prevent patients from obtain¬ 
ing optimal outcomes. Patient barriers may involve both direct and 
indirect costs: for instance the test or treatment recommended may 
be too expensive or they may have difficulty getting time off work 
to attend an appointment. Patients may not understand the impor¬ 
tance of a given treatment, or the treatment may not be a priority 
for them, in comparison to other demands in their life. Society 
and societal norms can sometimes serve as a barrier for patients. 
As individuals live within a larger society and lifestyles are generally 
collective, lifestyle changes (e.g., exercise or changes to a healthy 
diet) may be hard to implement at an individual level. Finally, bar¬ 
riers to optimal care may relate to patient’s expectations of a health¬ 
care encounter. For example, patients may seek medical attention 
for an upper respiratory tract infection (almost always viral), 
expecting to receive an antibiotic. However, guidelines do not 
recommend antibiotics in these situations since they would not be 
expected to be effective and may cause issues with antibiotic resis¬ 
tance. Leaving the physician’s office empty handed may make 
patients feel like their provider isn’t taking their concern seriously; 
and for physicians, it generally takes longer to explain to patients 
why an antibiotic is not appropriate compared with writing a pre¬ 
scription. If reducing antibiotic use is the goal of a health system, 
overcoming patient expectations would be important. 

Barriers at the level of the healthcare professional may lie in 
issues with lack of provider knowledge or uncertainty around how 
best to manage conditions. With respect to knowledge, it may sim¬ 
ply be a lack of awareness of evidence or guidelines, issues with 
information overload, or difficulties interpreting the quality of all 
of the evidence that clinicians are required to consider. Finally, 
there may be a lack of clarity around how to implement a new 
intervention or self-confidence in skills required in adjusting the 
doses of treatments—for instance new types of insulin regimens. 
With respect to uncertainty, in many areas of medicine, only low 
quality evidence exists. These areas may not be the highest priority 
for knowledge translation activities since convincing healthcare 
professionals to change practice in the absence of strong evidence 
may not only be difficult, for obvious reasons, but can also be met 
with resistance since key opinion leaders may not be in agreement 
with best practice. In these areas, in particular, it may be difficult 
for physicians to go against the usual standard of care they have 
been practicing. 

Barriers may also exist at the level of the healthcare team or 
healthcare organization. A service, test, or treatment may simply 
not be available or may be difficult for patients to access. 
Reimbursement may not be available for some healthcare services. 



Evidence-Based Decision-Making 7: Knowledge Translation 


493 


5.5 Selecting, 
Tailoring, 

and Implementing 
Interventions 


5.5.1 Physician 
Knowledge 


There may not be enough time to spend with patients to provide 
information or help them change their behaviors—this may be par¬ 
ticularly true for chronic disease management, which typically 
requires the active involvement of other allied healthcare 
professionals. 

Finally, there may be barriers at the level of the practice envi¬ 
ronment. For instance, it has been noted that one of the most 
important barriers to routine handwashing, considered a high pri¬ 
ority within hospitals to reduce the spread of hospital-acquired 
infections, is the availability of sinks. Specifically, the location and 
number of sinks to permit handwashing between each patient visit 
may be a significant barrier to proper hand hygiene. 

While there are different approaches to determining what bar¬ 
riers are most important within the identified problem area, conduct¬ 
ing a survey of the various stakeholder groups is usually required. 
To inform the survey, an initial focus group may be helpful to gener¬ 
ate examples of the barriers that may be playing an important role, 
followed by a formal survey to determine the most important and 
modifiable barriers. Patients, healthcare providers, and healthcare 
administrators should be included as stakeholders when assessing 
the most important barriers. Of particular interest are barriers that 
are modifiable, since these may be targets of the KT intervention. 

When considering what interventions might be effective at over¬ 
coming the evidence to practice gap, consideration of the most 
important modifiable barriers is critical. If the most important 
barrier is at the level of the healthcare organization, then patient 
and provider education would not be expected to be effective. 
Moreover, when more than one barrier exists, combinations of 
interventions may be more effective, and there is no consensus on 
whether interventions should generally be used on their own or in 
combination. In general though, the type of intervention selected 
depends on the type of barrier that needs to be addressed. This 
section is organized based on the type of barrier that was noted. 

When healthcare provider knowledge is identified as a barrier, then 
provider education is usually required. This could take different 
forms, including distributing educational materials to healthcare 
professionals. Studies indicate that this form of education has 
mixed effects on physician behavior. For instance, guideline imple¬ 
mentation strategies have been associated with a median improve¬ 
ment in care processes of around 8 % [18]. In general, continuing 
medical education, when delivered through a large conference or 
didactic teaching, has been shown to have minimal effect. 
Alternatively, providing education in a small group or interactive 
format has been shown to have a positive effect on practice and 
possibly clinical outcomes, particularly when there is a reinforcing 
activity that occurs following the education. Educational outreach 
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by experts and local opinion leaders also has been shown to be 
particularly effective when the noted gap relates to appropriate 
physician prescribing [18]. 


5.5.2 Information 
Overload, Lack 
of Awareness of new 
Evidence, and Lack 
of Clarity on How 
to Implement Intervention 


5.5.3 Strategies Aimed 
at Other Barriers 


Within this subset of knowledge-related barriers, clinical decision 
support systems have been tested. These may be used in inpatient 
and outpatient settings, using clinical pathways within electronic 
medical records or laboratory data sets (for instance, anticoagula¬ 
tion protocols, or antibiotic dosing protocols linked to laboratory 
data). Clinical decision support systems are most likely to be effec¬ 
tive when they are provided: (1) as part of regular clinician work- 
flow, i.e., providing care recommendations within a patient’s chart 
or electronic medical record so that clinicians do not need to seek 
out recommendations elsewhere; (2) at the time and location of 
decision-making, i.e., care recommendations provided as chart 
reminders during a patient encounter, rather than as monthly 
reports listing all the patients in need of services; or (3) as recom¬ 
mendations rather than a general assessment of the evidence [19]. 
Clinical decision support systems incorporating all of these ele¬ 
ments are likely to be particularly effective. 

Other knowledge implementation strategies that have been used 
to address a variety of barriers include audit and feedback strategies 
as well as simple reminders. With audit and feedback strategies, the 
performance of a physician, group, or organization with respect to 
a quality indicator is measured, and feedback is provided at the 
most relevant level, often to the individual provider. This is usually 
combined with recommendations around how to improve practice 
since audit and feedback has been shown to be most effective when 
combined with reminders and education. 

Simple reminders, be it verbal to patients or staff (i.e., during 
the course of regular care) or posters, have been shown to have the 
largest effects of any of the strategies used on their own [1]. 
However, there is large variation across studies, which may relate 
to their use in situations where reminders will not address the most 
important barrier. 

Substitution of tasks has also been shown to be an effective 
strategy for improving care. For instance, delegating tasks to nurses 
and pharmacists has been shown in some situations to improve use 
of guideline recommended treatments, including prescribing and 
cancer screening [18]. They may be particularly effective in situa¬ 
tions where clinical pathways have been developed to guide care. 
For instance, nurse and/or pharmacist led-anemia protocols 
have gained popularity in managing anemia in patients with kidney 
failure requiring erythropoietin since they have been shown to 
achieve similar or better outcomes with respect to hemoglobin and 
erythropoietin doses [20, 21]. 

Finally, patient-mediated interventions, usually reminders or 
various forms of education, can be effective, particularly for 
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5.6 Monitor 
Knowledge Use 


5.7 Evaluate 
Outcomes 


5.8 Sustaining 
Knowledge Use 


improving preventative care, including vaccinations [18]. Patient 
decision aids, including paper-based booklets, internet-based tools 
and videos, are often used to present the risks and benefits of 
different testing or treatment approaches. While they have been 
shown to influence patient behavior [1], the impact on other types 
of health outcomes is less certain. 

One of the most effective interventions involving patients as 
part of the healthcare team is facilitated relay , defined as clinical 
information transmitted by patients to clinicians by means other 
than the existing medical record [22]. The expectation is that clini¬ 
cians then act on the information to change patient management. 
Examples of facilitated relay are patients providing treatment 
guidelines to providers or sharing the results of home blood pres¬ 
sure readings with providers during a clinic visit. 

After an intervention is undertaken, it is important to assess 
whether provider and patient behavior is impacted and whether 
the information targeted within the intervention is being used. 
Assessing whether knowledge is being used is the first step to eval¬ 
uating for a change in outcomes. If knowledge has not changed, 
then it is unlikely that the intervention will impact care and out¬ 
comes. There are a variety of frameworks to assess knowledge use 
that can be used, including whether stakeholders are aware of the 
target information or whether it has changed behavior. There are 
many tools that can be used to assess this and readers are directed 
elsewhere for further information [ 1 ]. 

After assessing knowledge use, it is important to determine whether 
evidence to practice gaps have been narrowed or healthcare system 
performance has been affected. Often, this can be done using the 
same data set that was used to identify the problem—for instance, 
assessing whether quality indicators have changed, or evidence to 
care gaps have been closed. In addition to assessing whether prac¬ 
tice patterns and care have changed, evaluating whether clinical 
outcomes have improved is also important, though may not be 
feasible in all situations. 

Knowledge translation interventions typically focus on changing 
patient, provider, or health system behavior at a point in time. 
However, ensuring that such behavior is maintained requires a dif¬ 
ferent type of intervention. To date, few interventions have incor¬ 
porated the notion of sustainability, in part because of the ongoing 
resource requirements of such an intervention [ 1 ]. In addition to 
the barriers and facilitators to consider before implementing an 
intervention, it is likely that there are different barriers and 
facilitators to sustaining an intervention and the related practice 
change over the long term, and these require consideration when 
determining how best to ensure sustainability. 
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6 Bringing It All Together: Returning to the Example of Timing of Dialysis 
Initiation 


While dialysis is life saving for patients without any kidney function, 
starting dialysis in patients without symptoms is intrusive and has a 
negative impact on quality of life. As noted, the proportion of patients 
initiating dialysis “early” has increased over the past decade, and there 
is considerable variation in the proportion of patients initiating dialy¬ 
sis with an eGFR>10.5 ml/min (approximately 10 % kidney func¬ 
tion) in Canada [23], ranging from 20 to 60 % across geographic 
regions. In 2010, the first randomized trial to clearly inform timing 
of dialysis initiation was published, suggesting no benefit from early 
start dialysis [24]. This seminal publication provided needed evidence 
to inform care in this area. Combined with the noted variability in 
practice, the impact of dialysis on patient’s daily lives, and the fact 
that the care of patients on hemodialysis costs at least $85,000 per 
year [25], leaders of Canadian Kidney Care Programs who were 
surveyed in 2010 by the Canadian Kidney Knowledge Translation 
and Generation Network (CANN-NET) identified this topic as the 
top priority area for a new clinical practice guideline and knowledge 
translation activity [8]. 

Noting that a current guideline on timing of dialysis initiation 
was not available, CANN-NET engaged relevant kidney stakehold¬ 
ers, including the Canadian Society of Nephrology Clinical Practice 
Guidelines group, who established a guideline committee to revisit 
the cumulative evidence addressing the optimal timing of dialysis by 
conducting a systematic review addressing the optimal timing of the 
initiation of dialysis [26]. The review used the approach proposed 
by Grading of Recommendations Assessment, Development and 
Evaluation (GRADE) Working group, and adhered to prespecified 
protocols [27]. Briefly, the committee developed search strategies 
to identify studies comparing early versus late (as defined in included 
studies) initiation of dialysis. Mortality, hospitalization, and quality 
of life were prespecified as critical outcomes, utilization-related 
measures (time on dialysis, hospitalization, distance traveled for 
dialysis, and outpatient visits) as important outcomes, and nutri¬ 
tional surrogate markers were noted to be of interest, but not of 
major importance to decision-making. While 23 studies were iden¬ 
tified, only one randomized trial, the Initiating Dialysis Early and 
Late (IDEAL) study, was found [24]. Timing of dialysis initiation 
did not affect survival in the IDEAL study, with a hazard ratio of 
1.04 (95 % confidence interval [Cl] = 0.83-1.30; p=0.75 ) [24]. 
The study also noted similar quality of life in both treatment arms, 
despite a median delay of nearly 6 months in the initiation of 
dialysis in the late start group. The delay in use of dialysis resulted 
in lower healthcare costs of nearly CAN$18,000 in the “intent to 
defer initiation of dialysis” group [28]. 
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To assess barriers to optimal timing of dialysis initiation, a 
survey of nephrologists in Canada was conducted, which identified 
some underlying physician beliefs and possible misconceptions 
about timing of dialysis initiation [29]. Over 40 % of nephrologists 
felt that initiating dialysis at lower levels of eGFR would negatively 
impact quality of life and decrease use of peritoneal dialysis. Over 
half felt uremic symptoms (fatigue and nausea) occurred earlier in 
older patients or patients with more comorbidity—a view not 
supported by the IDEAL study [24]. Of importance, only 3 % of 
nephrologists worked at a facility or institution that had a formal 
policy regarding dialysis initiation. 

Based on the development of the clinical practice guidelines 
and the survey assessing barriers, several KT interventions are 
planned for implementation. The survey results on providers’ atti¬ 
tudes and the newly developed guidelines were presented in a spe¬ 
cial interactive forum at the annual Canadian Society of Nephrology 
meeting in 2013, attended by nearly a third of all Canadian 
nephrologists. A system of audit and feedback system was imple¬ 
mented in collaboration with the Canadian Organ Replacement 
Registry (CORR). CORR sends an annual facility-specific report 
to individual dialysis facilities, which includes a measure indicating 
the proportion of patients who have been followed by a 
Nephrologist who started with an eGFR >10.5 ml/min. This was 
presented alongside blinded information from other facilities in 
the immediate geographic region and national averages, followed 
up by communications from CANN-NET describing the relevance 
and important of this quality indicator. Finally, centers were 
directed to the recently published guideline, if appropriate. 

Since the survey identified knowledge-related barriers for pro¬ 
viders, and assuming that knowledge barriers were also relevant for 
patients, we also created several unique educational materials tar¬ 
geting both patients and healthcare providers. Infographics, a pic¬ 
torial highlighting the key messages targeting the intended audience, 
were created to illustrate the key guideline recommendations and 
emphasize the central role of patient preference and acceptance 
about when to initiate dialysis therapy (http://www.cann-net.ca/ 
images/Patient- facing_infographic_fro_PRINT_steps_to_planning_ 
for_dialysis-_Mar_17_2014.pdf). These dialysis initiation info¬ 
graphics will be placed in common patient care areas such as clinics, 
nursing stations, and physician offices. A whiteboard animated 
video was also developed discussing the state of the evidence, the 
importance of patient preference and choice, and the role of health¬ 
care providers (e.g., http://youtu.be/mi34xCfmLhw). Finally, an 
academic detailing visit is planned involving educational outreach 
by a core group of physicians using standardized educational 
materials. 

The evaluation component will include a pre- and post¬ 
intervention time series analysis using available national 





498 


Braden J. Manns 


administrative datasets to look for changes in the proportion of 
patients initiating dialysis “early.” A cluster randomized controlled 
trial is also planned to evaluate the impact of an expert site visit 
with standardized educational materials versus passive strategies 
alone. The outcome of interest will be the proportion of patients 
initiating dialysis “early.” Since it is possible that this strategy may 
result in excessive delays in dialysis initiation, as a safety outcome, 
the proportion of patients initiating dialysis as an inpatient will also 
be tracked. 

Utilizing the KTA cycle considerable practice variation regarding 
the timing of dialysis initiation in Canada was noted. Synthesizing 
the existing evidence demonstrated no benefit to early dialysis 
initiation and the adoption of a subsequent “intent-to-defer” 
strategy. Although the majority of nephrologists followed the 
existing evidence, key barriers were identified and thus a national 
knowledge translation strategy was developed to improve care. 
An evaluation of this strategy is underway. 


7 Summary 


Knowledge translation is meant to address the evidence to practice 
gap. This can be facilitated by use of the knowledge to action cycle, 
integrating policy-makers throughout the research cycle. The knowl¬ 
edge to action cycle begins with the identification of a problem 
(usually identified as a gap in care provision). After identification of 
the problem, knowledge creation is undertaken, initially involving 
knowledge inquiry, followed by synthesis of data into key knowledge 
products in the form of educational materials, guidelines, decision 
aids, or clinical pathways. The remaining components of the KTA 
cycle refer to the action of applying the knowledge that has been cre¬ 
ated. This includes adapting knowledge to local context, assessing 
barriers to knowledge use, selecting, tailoring implementing inter¬ 
ventions, monitoring knowledge use, evaluating outcomes, and 
sustaining knowledge use. Involving healthcare decision-makers 
and key stakeholders at each step increases the likelihood that new 
information will be incorporated into local practice. 


References 

1. Straus S, Tetroe J, Graham I (2010) Knowledge 
translation in health care: moving from evidence 
to practice. Wiley-Blackwell, West Sussex 

2. Graham ID, Logan J, Harrison MB, Straus SE, 
Tetroe J, Caswell W, Robinson N (2006) Lost in 
knowledge translation: time for a map? J Contin 
Educ Health Prof 26:13-24 

3. Canadian Institutes of Health Research (2014) 
More about knowledge translation, http:// 
cihr-irsc.gc.ca/e/39033.html#Definition 


4. Baker R, Camosso-Stefinovic J, Gillies C, Shaw 
EJ, Cheater F, Flottorp S, Robertson N (2010) 
Tailored interventions to overcome identified 
barriers to change: effects on professional prac¬ 
tice and health care outcomes. Cochrane 
Database Syst Rev 3, CD005470 

5. Improvement IHI (2003) The breakthrough 
series: IHI’s collaborative model for achieving 
breakthrough improvement, http://www.ihi.org/ 
knowledge/Pages/IHI WhitePapers / 



Evidence-Based Decision-Making 7: Knowledge Translation 


499 


TheBreakthroughSeriesIHIsCollaborative 
ModelforAchievingBreakthroughlmprove- 
ment.aspx. Accessed 5 Aug 2012 

6. National Kidney Foundation (2002) K/DOQI 
clinical practice guidelines for chronic kidney 
disease: evaluation, classification and stratifica¬ 
tion. Am J Kidney Dis 39:S1-S266 

7. Rosansky S, Clark WF, Eggers P, Glassock R 
(2009) Initiation of dialysis at higher GFRs: is 
the apparent rising ride of early dialysis harmful 
or helpful? Kidney Int 76:257-261 

8. Manns B, Barrett B, Evan M, Garg A, 
Hemmelgarn B, Kappel J, Klarenbach S, 
Madore F, Parfrey P, Samuel S, Soroka SD, Suri 
R, Tonelli M, Wald R, Walsh M, Zappitelli M, 
NeTwork FtCKKTaG (2014) Establishing a 
national knowledge translation and generation 
network in kidney disease: the CAnadian 
KidNey KNowledge TraNslation and 
GEneration NeTwork. Can J Kidney Health Dis 
1:2. doi:10. 1186/2054-3581-1181-1182 

9. Kinsman L, Rotter T, James E, Snow P, Willis J 
(2010) What is a clinical pathway? Development 
of a definition to inform the debate. BMC Med 
8:31 

10. Rotter T, Kinsman L, James E, Machotta A, 
Gothe H, Willis J, Snow P, Kugler J (2010) 
Clinical pathways: effects on professional 
practice, patient outcomes, length of stay and 
hospital costs. Cochrane Database Syst Rev. 
CD006632 

11. Whittle C, Hewison A (2007) Integrated care 
pathways: pathways to change in health care? 
J Health Organ Manag 21:297-306 

12. Scott S, Grimshaw J, Klassen T, Nettel- 
Aguirre A, Johnson D (2011) Understanding 
implementation processes of clinical pathways 
and clinical practice guidelines in pediatric 
contexts: a study protocol. Implement Sci 
6:133 

13. Allen D, Gillen E, RIxson L (2009) Systematic 
review of the effectiveness of integrated care 
pathways: what works, for whom, in which cir¬ 
cumstances? Int J Evid Based Healthc 7:61-74 

14. Sulch D, Perez I, Melbourn A, Kalra L (2008) 
Evaluation of an integrated care pathway for 
stroke unit rehabilitation. Age Ageing 29:87 

15. Cunningham S, Logan C, Lockerbie L, Dunn 
M, McMurray A, Prescott R (2008) Effect of 
an integrated care pathway on acute asthma/ 
wheeze in children attending hospital: cluster 
randomized trial. J Pediatr 152:315-320 

16. Manns B, Braun T, Edwards A, Grimshaw J, 
Hemmelgarn B, Husereau D, Ivers N, Johnson 
J, Long S, McBrien KA, Naugler C, Sargious P, 
Straus S, Tonelli M, Tricco A, Yu C, For the 
Alberta Innovates HSICDC (2013) Identifying 
strategies to improve diabetes care in Alberta, 


Canada, using the knowledge-to-action cycle. 
CMAJ Open 1(4):E142-E150 

17. Fervers B, Burgers JS, Haugh MC, Latreille J, 
Mlika-Cabanne N, Paquet L, Coulombe M, 
Poirier M, Burnand B (2006) Adaptation of 
clinical guidelines: a review of methods and 
experiences. Int J Health Care 18:167-176 

18. Grol R, Grimshaw J (2003) From best evidence 
to best practice: effective implementation of 
change in patients’ care. Lancet 362(9391): 
1225-1230 

19. Kawamoto K, Houlihan CA, Balas EA, Lobach 
DF (2005) Improving clinical practice using 
clinical decision support systems: a systematic 
review of trials to identify features critical to 
success. BMJ 330(7494):765-768E 

20. Brimble KS, Rabbat CG, McKenna P, Lambert 
K, Carlisle EJ (2003) Protocolized anemia 
management with erythropoietin in hemodi¬ 
alysis patients: a randomized controlled trial. 
J Am Soc Nephrol 14(10):2654-2661 

21. To LL, Stoner CP, Stolley SN, Buenviaje JD, 
Ziegler TW (2001) Effectiveness of a 
pharmacist-implemented anemia management 
protocol in an outpatient hemodialysis unit. 
Am J Health Syst Pharm 58(21 ):2061-2065 

22. Tricco AC, Ivers NM, Grimshaw JM, Moher 
D, Turner L, Galipeau J, Halperin I, Vachon B, 
Ramsay T, Manns B, Tonelli M, Shojania K 
(2012) Effectiveness of quality improvement 
strategies on the management of diabetes: a 
systematic review and meta-analysis. Lancet 
379(9833):2252-2261 

23. Clark WF, Na YB, Rosansky SJ, Sontrop JM, 
Macnab JJ, Glassock RJ, Eggers PW, Jackson 
K, Moist L (2011) Association between esti¬ 
mated glomerular filtration rate at initiation of 
dialysis and mortality. Can Med Assoc J 183: 
47-53 

24. Cooper BA, Branley P, Bulfone L, Collins JF, 
Craig JC, Fraenkel MB, Harris A, Johnson 
DW, Kesselhut J, Li JJ, Luxton G, Pilmore A, 
Tiller DJ, Harris DC, Pollock CA, Study I 
(2010) A randomized, controlled trial of early 
versus late initiation of dialysis. N Engl J Med 
363:609-619 

25. Lee H, Manns B, Taub K, Ghali WA, Dean S, 
Johnson D, Donaldson C (2002) Cost analysis 
of ongoing care of patients with end-stage renal 
disease: the impact of dialysis modality and dial¬ 
ysis access. Am J Kidney Dis 40(3):611-622 

26. Nesrallah GE, Mustafa RA, Clark WF, Bass A, 
Barnieh L, Hemmelgarn BR, Klarenbach S, 
Quinn RR, Hiremath S, Ravani P, Sood MM, 
Moist LM (2012) Canadian Society of 
Nephrology 2012 Clinical Practice Guidelines 
for timing the initiation of chronic dialysis. Can 
Med Assoc J 186:112-117 


500 


Braden J. Manns 


27. Grading Recommendations Assessment 
DaEGWG (2012) GRADE Working Group, 
2012. http: //www. gradeworkinggroup. org/ 

28. Harris A, Cooper BA, Li JJ, Bulfone L, Branley 
P, Collins JF, Craig JC, Fraenkel MB, Johnson 
DW, Kesselhut J, Luxton G, Pilmore A, 
Rosevear M, Tiller DJ, Pollock CA, Harris DC 
(2011) Cost-effectiveness of initiating dialysis 


early: a randomized controlled trial. Am J Kidney 
Dis 57(5):707-715 

29. Mann B, Manns B, Dart A, Kappel J, Molzahn 
A, Naimark D, Nessim S, Soroka SD, Zappitelli 
M, Sood M, (CANN-NET) ObotCKKTaGN 
(2014) An assessment of dialysis provider’s 
attitudes towards timing of dialysis initiation in 
Canada. Can J Kidney Health Dis 1(3) 


Chapter 30 


Evidence-Based Decision-Making 8: Health Policy, 
a Primer for Researchers 

Victor Maddalena 

Abstract 

There is a growing expectation that research will be used to inform decision-making. It is important for 
researchers to understand how health policy is developed and the different ways they can influence the 
development of policy. 

Public policy is developed to resolve identified problems. Health policy is a subset of public policy and 
is typically concerned with issues related to the health of populations either from a service delivery perspec¬ 
tive or from a broader public health and social determinants of health perspective. The policy planning 
algorithm is well established and follows the basic decision-making framework: assessment, planning, 
implementation, and evaluation. A variety of government and nongovernment stakeholders engage in 
complex debates to identify and resolve policy issues. In this chapter we explore how researchers can use 
their research to influence the development of health policy. Knowledge translation strategies focused on 
communicating research to policy-makers require considerable thought and planning. 

Key words Health policy, Policy planning algorithm, Decision-making, Knowledge translation 


1 Introduction 


This chapter will explore health policy and, more specifically, how 
epidemiological research can inform the development of health 
policy. While knowledge for knowledge’s sake is laudable, there is 
a growing expectation that research will answer important ques¬ 
tions or address issues facing the healthcare system, the health of 
populations, or society in general. It is therefore useful to under¬ 
stand how the researcher and their research can influence the 
development of policy. 

In this regard I will examine how health policy is developed, 
the social and political context within which policy is developed, 
and the ways researchers can be involved in—and influence—the 
policy process. I will use examples from Canada’s health system to 
illustrate the policy-making process. In particular I will focus on 
the communication of research results to policy-makers and present 


Patrick S. Parfrey and Brendan J. Barrett (eds.), Clinical Epidemiology: Practice and Methods, Methods in Molecular Biology, 
vol. 1281, DOI 10.1007/978-1-4939-2428-8_30, © Springer Science+Business Media New York 2015 

501 




502 


Victor Maddalena 


some strategies to engage the decision-makers of government. My 
goal is to present a layperson’s guide to policy development as 
opposed to a theoretical exposition of the intricacies of policy¬ 
making. Therefore, my focus and concern will be on practical 
considerations. 


2 What Is Health Policy? 

There are various definitions of public policy in the literature. 
A commonly cited definition of public policy is by Leslie Pal. She 
states public policy is “...a course of action chosen by public 
authorities to address a given problem or interrelated set of prob¬ 
lems” [1] (p. 2). Others, for example, Lydia Miljan defines policy 
as “...a conscious choice by governments that lead to deliberate 
action—the passage of law, the spending of money, an official 
speech or gesture or some other observable act—or inaction” [2] 
(p. 3). Silence or nonaction on a particular issue can also be a 
statement of policy [3]. Deliberate action in the form of policy 
directives can take the form of the allocation of resources, the 
enactment of laws or regulations, a publicly stated position, regu¬ 
lations, taxation, and so forth. Public policy is developed within 
an established framework or process and is generally consistent 
with social values. Policy can range from the legislation that 
enables Regional Health Authorities to deliver and monitor a 
wide range of health services to snow clearing of city streets and 
sidewalks. While the clearing of snow from streets after a snow¬ 
storm may seem far removed from the broad public policy 
domain, the process of snow removal from city streets is part of a 
set of broader level policy initiatives related to public safety, 
healthy cities, and pedestrian-friendly initiatives. 

Public policy is often subdivided into governmental or indus¬ 
trial sectors: fisheries policy, economic development policy, agri¬ 
cultural policy, national security policy, resource management 
policy, environmental policy, social policy, among others. Health 
policy is merely a subset of public policy and is typically concerned 
with issues related to the health of the population either from a 
service delivery perspective or from a broader public health per¬ 
spective, for example, the treatment or prevention of disease or 
the promotion of health. These sectors of policy development are 
not separate and discrete domains and there is often considerable 
overlap among various policy sectors. Depending on the author 
and their world view, health policy can be equally concerned with 
economic development, housing policy, public transit policy, and 
social policy as much as it is interested in addressing the broader 
determinants of health (education, employment, social networks, 
environment, etc.) and service delivery issues in the healthcare 
system [4]. 





Evidence-Based Decision-Making 8 ... 


503 


Within the policy realm there are different levels of policy¬ 
making, each with its own process for clarifying purpose and 
content and protocols for consultation and approval. Macro-level 
policy-making at the municipal, provincial, national, and inter¬ 
national levels generally takes the form of legislation, lawmak¬ 
ing, and establishing regulations. Policies at this level affect larger 
populations and have broad social implications and legal means of 
enforcement. The arena for public policy-making at the macro¬ 
level occurs within the public domain, in particular municipal 
councils, provincial legislatures or parliaments, and their support¬ 
ing infrastructure. 


3 Purpose of Public Policy 

The general purpose of public policy is to formalize initiatives 
established by governments and to achieve the overall mission of 
governments. Policies guide action, establish priorities, and pro¬ 
vide the means by which government directives are implemented 
and monitored. For example, one of the most significant policy 
statements a government can make is the approval of the annual 
budget. The budget is a clear statement of priorities, allocating 
scarce resources to various departments and initiatives to serve the 
public good. Therefore, public policy should reflect the general 
needs and values of the population it is serving [2, 3, 5]. 

The word “policy” conjures up visions of binders on shelves 
containing policies in public institutions. Here we need to distin¬ 
guish macro-level health policy from the kinds of institution- 
specific policy that you find for example in hospitals and other 
public institutions. Their general purpose is the same but there are 
some unique differences. Public institutions have a plethora of 
polices (and accompanying procedures) to govern and direct staff 
on everything from hiring of staff, procurement of goods or ser¬ 
vices, and financial management to policies on approvals related to 
preparing reports on occupational health and safety, human 
resource policy, protocols for the administration of intravenous 
drugs, record keeping, and a wide range of other specialized activi¬ 
ties that require standardized application. These policies set by 
administration play an important role in providing consistency and 
integration of activities within an institution. Institutional policy is 
developed and approved within the organization and may be 
unique to the organization or set by industry standards or shaped 
by legislation and regulatory requirements. The kind of policy we 
are concerned about in this chapter is macro-level governmental 
health policy that generally takes the form of legislation or regula¬ 
tions that direct and shape the delivery of health care or other 
social services. 
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As an example of macro-level governmental policy-making, let 
us examine one of Canada’s most well-known public policies, the 
Canada Health Act [6] or Medicare as it is informally known. To 
understand the Canada Health act, we need to go back in history 
to examine how federal and provincial powers were established as 
they relate to health care. In Canada the federal and provincial 
governments share responsibilities for health care. The Canadian 
Constitution Act (1867/1982), formerly known as the British 
North America Act (1867), outlines the structure and operation of 
the Government of Canada and powers held by the federal and 
provincial governments [7]. Sections of the Act specifically related 
to health care include Section 92(7) wherein it states that Provincial 
governments are responsible for “The Establishment, Maintenance, 
and Management of Hospitals, Asylums, Charities, and 
Eleemosynary 1 Institutions in and for the Province, other than 
Marine Hospitals.” The somewhat out-of-date wording reflects 
the time period when the original act was written in 1867. Simply 
stated, the Act states that provincial governments can establish 
their own priorities for health care, manage their own budgets, and 
plan services to meet the needs of their population—in other 
words, establish policy. 

The Canadian Constitution Act also defines the responsibility 
of the Federal government as it pertains to health [7]. The Federal 
government pays some of the costs of health care (further detailed 
in the provisions of the Canada Health Act (CHA) [6] and the 
Canada Health Transfer 2 and various health accords that deter¬ 
mine levels of funding) and it is responsible for the provision of 
health care to specific groups, including Aboriginal Canadians, the 
Royal Canadian Mounted Police (RCMP), prisoners in Federal 
penitentiaries, refugee claimants, and the Canadian Military [8]. 
The federal government sets the criteria and conditions that must 
be met by the provinces to access federal funding under the provi¬ 
sions of the CHA [9]. 

The Canada Health Act is one of the defining cornerstones of 
Canadian public policy [6]. The CHA embodies the Canadian val¬ 
ues of equity and unity and outlines the principle objective of 
Canadian health policy which is to “...to protect, promote and 
restore the physical and mental well-being of residents of Canada 
and to facilitate reasonable access to health services without finan¬ 
cial or other barriers” (CHA Sec.3). The Act ensures “...that all 
eligible residents of Canada have reasonable access to medically 


Eleemosynary means relying on charity. 

"The Canada Health Transfer, or CHT, is the largest transfer of financial 
resources from the federal government to the provinces and territories. It 
provides long-term predictable funding for health care and is consistent with 
the principles of the Canada Health Act. Source: https://www.fin.gc.ca/fedprov/ 
cht-eng.asp 
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necessary services on a prepaid basis, without charges directly 
related to the provision of insured health services” [9]. The CHA, 
as a statement of public policy, outlines the principles and struc¬ 
tures that govern Canada’s health insurance system. Specifically, 
those principles are public administration, comprehensiveness, 
universality, portability, and accessibility. These principles enshrine 
in legislation the values inherent in Canada’s health system [9]. 
The Canada Health Act is one policy initiative that has served to 
define a nation and establish the priorities and values associated 
with the delivery of health care. All other health policy initiatives 
in Canada are either directly or indirectly influenced by this foun¬ 
dational policy. 

It is clear that public policy is a powerful tool that can shape 
the mission and goals of the public domain. Indeed it is because 
policy is a powerful way to shape direction that organizations and 
individuals seek to influence these directions. The process of policy 
development is very dynamic and takes place in a particular time 
and place and historical, social, and political context. 


4 The Policy Arena 


The “policy arena” is a commonly used metaphor for the social and 
political context within which policy issues are debated and devel¬ 
oped. The image of a sports arena with various players or teams 
competing to “score” points is a reasonably good analogy. Open 
any daily newspaper or watch the evening news on television and 
you will see a wide range of public policy issues being debated, 
refuted, criticized, supported, or examined. On any given issue 
there are a wide range of stakeholders (individuals or groups) that 
have an interest in that issue. There are many players or “actors” in 
the policy arena. The old saying, “You can’t tell your players with¬ 
out a program” in the sports context applies equally well to the 
policy development process. 

For example within Canadian provincial governments, there 
are a host of players involved in the policy process including vari¬ 
ous levels of junior and senior policy analysts, senior administra¬ 
tors, and of course politicians including Ministers and their support 
staff. The ruling party in government will have a Premier and a 
Cabinet comprised of Cabinet Ministers representing various port¬ 
folios or departments. The Premier is supported by the Office of 
the Executive Council and the Office of the Premier. The work of 
the Cabinet is supported by the Cabinet Secretariat (Privy Council 
Office at the Federal government level) and a wide range of 
Standing Committees, senior advisors, and communications per¬ 
sonnel [3, 10, 11]. 

Outside of government there are a wide range of players or 
actors that have an equally important role to play in the 
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development of public policy including interest groups, lobby 
groups, nongovernmental organizations, the media, and indi¬ 
vidual citizens. While there are rare occasions when all stake¬ 
holders are in agreement, it is usually the case that there are 
many divergent opinions on the definition of the problem, the 
potential solutions, and who ultimately will benefit from the 
outcomes of a policy issue. 

To illustrate the policy arena in action, it may be helpful to cite 
a specific example. Health human resources (HHR) have been and 
will likely continue to be a key policy priority for governments, 
health professional organizations, service delivery organizations, 
and educational institutions. The ultimate goal of health human 
resource policy is to ensure reasonable access to the right numbers 
and mix of health professionals in a reasonable period of time, in an 
appropriate setting, and at a reasonable cost. On the surface this 
may seem like a fairly simple problem with an equally simple solu¬ 
tion. The challenge in this example, however, is there are many 
regulated health professional groups (physicians, nurses, physio¬ 
therapists, occupational therapists, pharmacists, among others) 
that have different opinions regarding the optimal number and 
distribution of health professionals to meet the health needs of the 
population or insured group. 

Health professional organizations at the national and provin¬ 
cial level, the agencies that regulate and license their practice, orga¬ 
nizations that represent their professional interests, the educational 
institutions that train those professionals, and the governments 
that fund their salaries and the services they provide engage in a 
complex debate to ensure reasonable access to health care. In addi¬ 
tion there are citizen-led public interest groups advocating access 
to services that require health human resources. Add to this the 
concerns and interests of those health organizations that utilize the 
services of health professionals and it is clear the “policy arena” can 
get crowded and the issues, complex. 

In Canada health is a provincial responsibility and therefore 
each province can make regulations affecting health professions. In 
this regard the decisions made by one province can have a signifi¬ 
cant impact on the policies of another province and while there is 
some degree of standardization across the provinces and territo¬ 
ries, there is subtle difference among the provinces. And then there 
is the issue of health professions seeking to improve their own posi¬ 
tion vis-a-vis other health professions (also known as professional 
turf wars). All of these groups and stakeholders have an interest in 
shaping the problem and identifying solutions. While the debate 
seeks to ultimately serve the public interest, it is often difficult to 
distinguish between arguments that purport to serve the public 
interest versus those that benefit professional interests. At the root 
of the discussion is the basic problem of resource allocation; unlim¬ 
ited wants and limited resources and the need for governments to 
make difficult choices! 
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A policy example is the possible harmful effects of tanning 
salons and the need to regulate access to tanning beds, particularly 
for individuals under the age of 19. One side of the debate is advo¬ 
cating that tanning beds are unhealthy because exposure to harm¬ 
ful doses of ultraviolet radiation may lead to the development of 
skin cancer and they argue that people under 19 years of age should 
be banned from accessing tanning beds. Proponents of the ban cite 
research that documents the harmful effects of ultraviolet light that 
is emitted from the tanning beds. The business owners of tanning 
salons and their patrons, however, take a slightly different view on 
the issue. They believe that tanning beds, used properly, are not 
harmful. They believe the research is inconclusive, that other life¬ 
style habits in society that are harmful are not banned (so why 
focus on tanning beds?), that individuals who live in a free society 
should have the right to access tanning beds, and so forth. 

Agencies promoting cancer prevention sell their message to 
the public, media, politicians, and bureaucrats using research, 
briefings, meetings with senior officials, and press releases to advo¬ 
cate for a ban on tanning beds for individuals under 19 years of 
age. They also seek support and endorsement from health profes¬ 
sional groups who also hold a similar view. There is strength in 
numbers. The stronger the argument, the stronger the evidence, 
the greater likelihood the public and government regulators will 
agree. The business owners, in turn, also engage in their own social 
marketing by issuing their side of the story, and they lay out their 
arguments against the prohibition of tanning beds to individuals 
under 19. In this policy debate the advocates for a ban on youth 
tanning appear to have won the argument and a ban has been sup¬ 
ported in many jurisdictions. 

Public policy debates rarely take place in private. Special inter¬ 
est groups and individual citizens can participate in the policy pro¬ 
cess through a variety of means including participating in 
government sponsored consultation processes (e.g., surveys, opin¬ 
ion polls, focus groups), expert consultation processes, legislative 
hearings, submissions or presentations to Commissions or Special 
Task Forces, among others. Perhaps the most common form of 
influencing the policy process is by citizens casting a vote in elec¬ 
tions or a referendum. Citizens or interest groups can engage in 
other forms of policy advocacy including letter-writing, commu¬ 
nity activism, town hall meetings, and preparing and submitting 
briefs or position statements to government [3, 5, 12, 13]. 

The stakeholders (also known as actors) with an interest in an 
issue seek to promote their views on any given issue. In this regard 
the media is recognized as a significant player in the public policy 
process. Policy debates rarely take place in private. Stakeholder 
groups seek to promote their views using various forms of com¬ 
munication including the popular media. Large and powerful 
stakeholder groups recognize the importance of effective public 
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relations and communications to promote their views to the public 
and to government. The media may or may not take an interest in 
the issue as a possible news story. Both sides of a policy debate will 
seek to present their perspective to the media to win public and 
political support [5, 14]. 

In the middle of the debate, but certainly not a bystander, is 
the bureaucracy of government. They are usually the recipient of 
all perspectives on a policy issue. They seek to understand the issues 
and ensure their government officials and political figures are 
appropriately briefed on all aspects of the problem and it is their 
job to examine and bring forth for consideration possible reme¬ 
dies. Each side of the debate will lay out their research and evi¬ 
dence to support their viewpoints. Sometimes this evidence is 
conflicting and unclear. Influencing the media in a policy debate 
can be a powerful ally. If an issue can generate significant public 
interest or even better—outrage—from the public, then govern¬ 
ments will usually respond. Power and politics go hand in hand. 
Well-organized and well-resourced interest groups and stakehold¬ 
ers are at a distinct advantage in a policy debate when compared to 
unorganized, vulnerable, or marginalized groups that have limited 
resources. 

In the midst of all of this debate, research can play an impor¬ 
tant role in informing the policy debate on any given issue. And 
therefore researchers can and should seek to have their research 
heard. 


5 A Policy Planning Algorithm 

At a very basic level health policy is concerned with problem¬ 
solving [1]. Problems are encountered or identified (current or 
anticipated) and in response government develops policy to address 
the problem. The primary objective of policy development is to 
comprehensively assess the policy issue or problem, identify and 
implement the most effective and cost-efficient solutions in a man¬ 
ner that is consistent with social values and within existing policy 
structures and processes. In this regard policy-making is very simi¬ 
lar to the kinds of problem-solving that occurs in business or in 
clinical settings. The difference is in scope of influence. 

When a business identifies a problem and devises a solution, 
the impact is generally limited to that business. When a govern¬ 
ment identifies a problem and devises a solution, the impacts can 
be far-reaching and influence large populations. While there are a 
variety of frameworks and diagrams describing the complexities of 
policy-making, the basic decision-making algorithm is a circular 
process and includes assessment, planning, implementation, and 
evaluation. 
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5.1 Assessment Individuals, special interest groups, and political groups seek to 

have their issue added to the policy agenda. Some groups are very 
powerful and can easily get the attention of government and the 
media, for example, the provincial or national physician or nursing 
associations, or a range of national charitable organizations or 
groups representing the interests of the pharmaceutical industry. 
Other groups are less well recognized and their voices may not be 
as easily heard in the policy arena. 

Whenever a policy issue or problem is identified, the first and 
paramount concern is the need to clearly define the problem and 
this includes understanding the issue, the context for the issue, and 
the reasons why it is perceived as a problem. In the assessment 
phase of policy development, the principle objective is to define 
and delineate the problem. Key questions need to be answered 
include for example, what is the nature of the problem? More 
importantly, who has identified and defined the problem? What are 
the sources of information to assist you in defining the problem? Is 
this a new problem? If not a new problem then how have other 
jurisdictions addressed the issue? If it is a new issue what policy 
options are available to address the problem? Are these options 
consistent with prevailing social values? Are the options legal, via¬ 
ble, cost-effective, and publicly acceptable? What is the cost (finan¬ 
cial or other) to address (or not address) the problem? What are 
the longer term implications of the policy options? Are the options 
being considered consistent with the political views of the govern¬ 
ment in power? Will these policy options have any unintended 
effects on particular groups? Who is most affected by this issue and 
have they been consulted? And perhaps most important, will one 
of these policy options actually solve the problem! The assessment 
or problem defining state of policy development is a critical and 
essential step in the process. 

As the problem is being identified and debated, government is 
receiving letters, presentations, and position papers from individu¬ 
als, businesses, and stakeholder groups each stating their own views 
on the subject and trying to influence the policy process by seeking 
to define the problem (and identify solutions) from their own 
perspective. 

Governments seek to understand the various positions, under¬ 
stand the research behind the positions, and try to determine (a) is 
this issue worthy of government intervention? (b) if the govern¬ 
ment did intervene what options are available to formulate a “good 
policy” response? Because the legislative agenda is so full, policy 
does not make its way to the floor of a legislature unless it has been 
identified as a priority. 

5.2 Planning Once the problem has been clearly articulated, all the issues have 

been identified and discussed and evidence presented, the process 
of actually planning the policy response take on a more serious tone. 
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Governments seek to implement policy that will resolve the prob¬ 
lem in the most cost-effective manner, with minimal negative con¬ 
sequences and ultimately serving the greatest good for the greatest 
number of people. 

Stakeholder groups, experts, or researchers will often be invited 
to participate in consultations and planning the policy response 
because they are often in a position to more fully inform the pro¬ 
cess and understand the implications and potential outcomes of 
the initiative. The policy response is often determined by actions 
that have been implemented in other jurisdictions. Policy analysts 
in government often develop a network of contacts in other pro¬ 
vincial governments in their area of expertise. When a problem 
arises, they contact their network of colleagues in other jurisdic¬ 
tions with the intent of asking the question, “Have you encoun¬ 
tered this problem?” and if so “What did you do about it?” In 
some cases the problem is unique to a particular situation or con¬ 
text and a new solution must be generated. In these instances the 
bureaucracy will engage experts inside and outside of government 
to assist with generating policy options. 

5.3 Implementation As policy initiatives are narrowed down to viable options, a new set 

of questions are asked including what level of policy intervention is 
necessary? Does this problem require legislation, regulation, or a 
lesser form of policy statement from the government? In some 
instances government may determine that the best role it can play 
is to act as a broker for industry or among groups to resolve the 
problems without government intervention. 

Once it is determined the government will implement legis¬ 
lation or regulation, it has to go through a series of internal and 
external vetting procedures. In the case of legislation and regula¬ 
tions, the Minister will need to assess the impact of the policy 
before implementation. Indeed most government agencies 
responsible and Cabinet will need to have a detailed process for 
assessing the impact of a policy option in a variety of domains, 
for example, economic impact, health impact, social inclusion, 
costs, monitoring, evaluation, human resources, public relations, 
impact on other departments in government, and national or 
international implications. 

I will not review the process for implementing legislative and 
regulations in this chapter; there are several good resources avail¬ 
able to describe the detailed and lengthy process of generating leg¬ 
islation and regulations [3, 15]. 

Suffice to say that once the legislation or regulation or policy 
has been implemented, there is a process put in place to monitor 
the outcomes. Politicians and political parties play an important 
role in policy development and in the implementation of policy. 
Political parties adopt platforms that, if elected, will form the out¬ 
line for their policy agenda. When elected, the government in 
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power can create a legislative agenda based on their priorities and 
create legislation. In the policy arena a large number of interest 
groups lobby and put forth their ideas of what should be on the 
policy agenda of governments. From this the government in power 
together with their bureaucracy sifts through a wide range of con¬ 
cerns identified in government and identified by the public through 
stakeholders groups. 

In some cases, even in the face of overwhelming scientific or 
research evidence to support a particular policy intervention, gov¬ 
ernments will decide not to intervene using legislation or regula¬ 
tions. A good example of this is in Canada recently when the 
Federal government decided to not regulate the food industry in 
terms of reducing dietary sodium in food products. Instead, while 
they acknowledge the impact of increased dietary sodium on the 
health of the population, they decided to not take a firm policy 
stance on the issue and instead decided to work with the provinces 
and industry to resolve the problem. They state, 

The federal, provincial and territorial governments are committed to 
helping create conditions that make the healthier choice the easier 
choice. Sodium reduction is an important part of healthy living and the 
governments have been working together towards supporting 
Canadians in their sodium reduction efforts. The goal is to work 
towards reducing the average sodium intake of Canadians to 2,300 mg 
per day by 2016. With this goal in mind, the government is: a) working 
to increase the awareness and education of Canadians on the issue of 
sodium as part of healthy eating; b) supporting research related to 
sodium reduction; c) providing guidance to assist the food industry in 
lowering the amount of sodium in processed foods. [16] 

In this way, the government works with industry to achieve the 
desired outcome, without the imposition of legislation or strict 
regulatory constraints. 

Ministers of Health (or other portfolios in government) are 
regularly briefed on a wide range of issues. The legislative agenda 
is usually very full and only the most pressing legislative concerns 
rise to the top for action. Governments constantly prioritize the 
policy issues and decide which among them are a high priority and 
worthy of immediate attention. Other issues not on the policy 
agenda are placed in the cue to be dealt with at a later time. 

5.4 Evaluation Due to increased public scrutiny and accountability demands being 

placed on governments, there is a growing need to evaluate policy 
initiatives. A wide range of public “watch dog” organizations, non¬ 
governmental organizations, the media, and academic or policy 
think tanks—not to mention the Office of the Auditor General— 
are also actively engaged in public policy evaluation. Again, the 
objective is to determine if the policy does what it was supposed to 
do? Bobby Sui suggests policy-makers should ask the following 
questions to determine if a policy intervention is “good” policy: 
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1. Are the interests of stakeholder groups well balanced? 

2. Is an accountability framework well articulated? 

3. Are the objectives and expected impacts of the public policy 
explicitly stated? 

4. Is the public policy the most cost-effective way to resolve the 
problem? 

5. Is the public policy just to everyone affected by its 
implementation ? 

6. Does the public policy balance short- and long-term consider¬ 
ations? [3] (p. 84) 

Good policy should reflect the values of society and consistent 
with the principles of justice and fairness. Evaluation is typically an 
ongoing process from the time of implementation. Governments 
typically establish a monitoring process (formal or informal) to 
track implementation and assess whether the policy has actually 
solved the problem. 


6 Researchers and Policy 

The good news is that researchers can play an important role at all 
stages of policy development. Increasingly governments tend not 
to take on the role of conducting their own research. The research 
capacity of many government departments has decreased over the 
past 20 years. This is due in part to a loss of capacity because of 
pressures to keep the size of governments small and because the 
range of issues facing the government at any one time is so signifi¬ 
cant it is difficult to develop expertise in the full range of issues. 

For example in Canada, the Cabinet Secretariat at the provin¬ 
cial level or the Privy Council at the federal level and the general 
public service provides support to the Premier (provincial), Prime 
Minister (Federal), and Cabinet and it is their job to ensure the 
appropriate Minister(s) and Cabinet members understand the full 
range of options for resolving problems, the impact of each option 
(financial, political, or other), who is for and against each option, 
and any risks or mitigating factors. Policy analysts and senior 
bureaucrats prepare detailed briefings documents and presenta¬ 
tions to ensure their political masters fully understand the scope of 
the problems and options available to them. The Cabinet Secretariat 
or Privy Council play an important role in coordinating the mass of 
information that the government in power needs to understand to 
facilitate the development of good policy decisions [17]. 

As a researcher you may be drawn into the policy process in 
one of several ways: either you have done research that you feel can 
inform policy and you actively seek opportunities to share your 
research with policy-makers to influence change, or government 
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will approach you to seek your expertise and advice on a subject 
area because you have done research. A third option is that your 
research forms part of a body of literature that may inform policy 
at a later time. 

In progressive jurisdictions government and consortiums of 
health researchers come together to formally collaborate to address 
pressing policy issues. Programs such as the Contextualized Health 
Research Synthesis Program (CHRSP) of the Newfoundland and 
Labrador Centre for Applied Health Research (NLCAHR) are a 
good example of how researchers collaborate with government to 
address policy issues. NLCAHR brings together decision-makers in 
government and researchers to identify important policy questions 
and to synthesize existing research and contextualize the findings to 
Newfoundland and Labrador [18]. Specifically, the CHRSP pro¬ 
gram brings together policy-makers and researchers to: 

“.. .focus on specific issues, rather than broad research themes; identify 
issues of concern to health system leaders; use research expertise to 
formulate researchable questions; synthesize quality research literature 
(systematic reviews); tailor the syntheses to the local context (chal¬ 
lenges, capacities); report research results quickly and in usable for¬ 
mats.” [18]. 

There are various other examples of government-researcher 
collaborations across Canada that bring together policy-makers 
and researchers. In some cases government provide funding to 
these agencies to fund their research projects. 


7 Communicating Research Results to Policy-Makers 

The first thing you will notice about policy-makers in government 
is they work on a very different scale of time and second; they are 
concerned with a different set of priorities. Researchers, even work¬ 
ing on the fast track, take a considerable period of time from their 
idea to the final stage of presenting and publishing their research. 
Policy-makers on the other hand work with much shorter time 
frames. When a problem is identified, policy-makers need the 
research results to inform their policy decision yesterday, or in the 
best case scenario, today. When researchers and policy-makers 
work together, this is often the first point of concern. They need 
information today and it takes time to generate research to answer 
their very pressing questions. Researchers view their research as a 
process and policy-makers tend to view research as a product that 
can inform their policy-making [19]. 

The second feature—answering to a different set of priorities— 
focuses on the fact that the primary role of a public servant is to 
serve the government in power, and through the government they 
serve the public. Ultimately, policy-makers want to provide the 
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best advice to their senior officials and they have to sift through 
large amounts of information and navigate through a wide range 
of stakeholders and special interest groups concerns and issues. 

Knowledge translation, the process of translating and commu¬ 
nicating research findings to end users for example, industry or 
decision-makers, is an essential component of the research process. 
The Canadian Institutes of Health Research define knowledge 
translation as, 

...a dynamic and iterative process that includes synthesis, dissemina¬ 
tion, exchange and ethically-sound application of knowledge to 
improve the health of Canadians, provide more effective health ser¬ 
vices and products and strengthen the health care system. This process 
takes place within a complex system of interactions between research¬ 
ers and knowledge users which may vary in intensity, complexity and 
level of engagement depending on the nature of the research and the 
findings as well as the needs of the particular knowledge user. [20] 

Indeed detailed descriptions of knowledge translation pro¬ 
cesses are an essential component of many research grant applica¬ 
tions. It is also important to engage and involve policy-makers 
early in the research endeavor and maintain open communication 
regarding the progress. The earlier and more often policy-makers 
are involved and the more they are engaged during the research, 
the greater the likelihood the results of the research will be utilized 
in the policy process [19]. 

We are often in the position of being asked to share the results 
of our research with decision-makers in health organizations or 
governments. As researchers we are concerned about all aspects of 
our research. The research process is just as important to us as the 
outcome. Our training has taught us to be attentive to ethical con¬ 
siderations, research design, and ensuring our methods are appro¬ 
priately suited to answering our research questions. Rigor, validity, 
reliability, and generalizability are words that permeate our research 
conversations. When we present our research in academic settings 
or at conferences, we carefully describe our process: our research 
question, our methods for data collection, our methods of analysis 
and limitations. Findings and conclusions arising from our research 
are derived from the data and analysis and are generally limited to 
the scope of the project we have undertaken. As researchers we 
tend to follow a well-rehearsed script when presenting our research 
to colleagues and attention to detail is important. 

Policy-makers, however, have different concerns in mind when 
they look to research to inform the policy process. They are less 
concerned about the procedures of research; rather they are con¬ 
cerned about the outcomes of research. A policy-maker assumes 
you have done your research according to academic standards and 
that your data collection and analysis will withstand scrutiny by 
your peers. Rather they are more concerned with the findings, 
conclusions, and perhaps more important, the implications of your 
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research as it relates to the issue of concern. This has important 
implications if you are invited to present your research to govern¬ 
ment officials. While many senior analysts in government have a 
background in research, it is generally good practice to repackage 
your academic presentation and tailor it to your audience. Spend 
less time explaining the details of your detailed analysis and spend 
more time explaining what the research “means.” 

In clinical practice there is an expectation that there is a body 
of research to warrant a change in clinical practice as opposed to a 
single study. Similarly in policy-making a single study may or may 
not serve as the basis for a change in policy. As a researcher, it is not 
realistic to assume that your single project will be sufficient to 
change policy. Rather expect that governments will similarly be 
seeking a body of research to support their decision-making and 
justify a change in policy. And be aware there are other factors that 
will be considered in policy decision-making. 

The well-known axiom of “research informed policy-making” 
and “evidence-based decision-making” can quickly become a dif¬ 
ferent kind of exercise known in the field as “decision-based evi¬ 
dence making.” In other words, policy-makers may have a particular 
policy direction in mind and search the research literature to see if 
there is research to support the policy direction governments are 
interested in pursuing. This is not to cast a dim light on the integ¬ 
rity of policy-makers, but rather to recognize that the policy pro¬ 
cess can be very complex and many factors are considered when 
contemplating a change in direction especially if the policy issue is 
contentious. 


8 Conclusions and a Checklist 

Researchers can play an important role in the development of 
health policy. Indeed there is a growing expectation that research 
will inform the important decisions faced by health organizations 
and governments. The health policy development process is com¬ 
plex and involves many governmental and nongovernmental stake¬ 
holders. Communicating research to government or NGO 
policy-makers requires a shift in focus away from attention to the 
details of the research to a more concise focus on the outcomes. 
The following checklist may serve as a helpful guide to ensure your 
research can inform policy decisions. 

8.1 Checklist 1. Be aware and keep abreast of the current policy issues related 

to your field of research. 

2. As you conceptualize your research project, identify and build 
relationships with decision-makers that may have an interest in the 
outcomes of your research. Apprise them of the goals and aims of 
your research and keep them updated on your progress. 
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3. Explore unique knowledge translation opportunities to share 
the results of your research with potential end users. 

4. Try to involve decision-makers in your research as collabora¬ 
tors; if they choose not to be involved, keep the decision¬ 
makers apprised of your progress. 

5. Seek opportunities to get involved in public consultation ses¬ 
sions on policy issues related to your research. 

6. If your research is relevant to a current policy debate, seek 
opportunities to talk about your research in public media. 
Radio stations and newspapers are always seeking interesting 
content on current topics. 

7. If invited to present your research to decision-makers, remember 
a few pointers: understand your audience, stay on topic, keep 
your messages simple, avoid jargon, if you have any conflicts of 
interests state them up front, and focus on your results and 
their policy implications rather than on your methods (no mat¬ 
ter how interesting). Finally, provide brief summaries of your 
research in easy to understand language. 
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types of.317-320 

use by health professionals.327-329 

weaknesses.316, 322-323 

Economic issues.170 

EMBASE and MEDLINE, differences of.426 

Endophenotype-based approach, role.359 

End-point adjudication committee.173 

End-stage renal disease 

clinical trials.220 

cost-effectiveness analysis, 

EVOLVE trial.253-254 

grounded theory study.304 


knowledge translation, 
Equipments 
additional, 


required.277 

Errors 

accidental.6 

alpha error.9 

effect on study results.6 

measurement.11, 203, 339 

random.5, 6, 8,11, 32, 35, 71, 72, 335, 390 

systematic error.5-7, 31, 32, 34, 35, 

42-44, 72,121, 334, 335 

type I and type II, in clinical trial.167 

types of.6, 335 

European Regulatory Issues on Quality of Life Assessment 

Group.195 

European Union, HRQOL guidelines.194-195 

Evidence-based decision making 

administrative database utilization.469-481 

clinical practice guidelines.443-452 

critical appraisal.385-395 

health policy.501-516 

health technology assessment.417-438 

knowledge translation.485-498 

meta-analysis.397-412 

systematic review.397-412 

translational research.455-466 

Evidence-based research, key elements of.273-274 
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Evidence tables, used in CPG.447 

Evidence to practice gap.485, 491, 493, 

495,498 

Exposed vs. unexposed subjects.55 

Extended generalized linear models 

choice of analytical tool.115 

correcting model variance.114-116 

fixed effects.112-114 

generalized estimating equations.115, 116 

intraclass correlation coefficient.113 

mixed effect models.114 

panel data layout.111-112 

random effects.113,114, 120 

variance components, of data.113 

Extended survival models 

correlation among events.117 

correlation in survival times.116 

event dependence within subject.116 

ordered events.119-120 

risk set, for survival analysis.118-120 

shared frailty.117, 121 

time-dependent effects.120 

time-varying covariate.120 

unordered events.118,123,126 

unshared frailty.116 

External validity.11-12, 57, 58,149, 

152,154, 345 


F 

Factorial designs.15,163,169,185, 242 

Final outcome.10,16, 340 

Follow-up 

outcome measures.166 

rate of loss.170-171 

time to complete.165 

Food and Drug Administration (FDA), U.S.166, 194, 

196,280 

Forest plot. 121, 404, 405, 407 

Foundations. See also Funding 

charitable/not-for-profit agencies. 275 

private and public, for research grants.275 

Frailty model.121-122 

generalized linear models.94-96 

Functions, concept of..72-73 

Funding 

agency research theme.273 

for clinical research.274-276 

costs.172 

distribution.282 

license required for.172 

reporting.335 

sources of.274-276 

Funnel plot.403, 405, 406 


Gene identification strategies 

association analysis in.352 

linkage methods in.352 

Generalized linear models 

limitation.94 

members, attributes in.94 

regression coefficients.95 

standard link functions and inverses.95 

systematic component of.94 

validity.97 

Genetic association study.359-361 

interpretation of..361 

Genetic diseases, clinical epidemiologic studies 

ascertainment bias.335, 336 

bias in different designs.13, 343 

bias minimizing methods.399 

competing risks bias.335-337 

compliance bias.335, 342 

confounding, 

confounding minimizing strategies, 

contamination bias.335, 342-343 

diagnostic bias.335, 341 

family information bias.335, 342 

information bias.335, 339, 342, 345 

intervention bias.335, 342 

lead time bias.335, 339-340 

length time bias.335, 340, 341 

loss to follow up bias.335, 337-338, 344 

matching.345 

multivariate modeling.345, 346 

non response bias.335, 337 

overmatching bias.335, 338 

prevalence-incidence bias.335, 344 

proficiency bias.335, 342 

recall bias.335, 339, 344 

restriction.345 

selection bias. 335-338, 343, 344, 346 

stratification.345, 346, 360-362 

survivor treatment selection bias.335, 338 

volunteer bias.335, 337, 344 

Will Rogers phenomenon.335, 341 

Genetic epidemiology, in complex traits 

allele sharing method.353-356 

analysis of linkage studies.355-356 

association analysis.350, 352 

association-based study challenges.357-363 

candidate gene approach.358 

complex traits.350-351 

false positive associations.361 

genetic risk ratio.364 

genome-wide association study (GWAS).358-359, 

361-365 
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genomic imprinting.352 

genotyping technology.358-361, 365 

limitations of linkage studies.356-357 

linkage methods.353-357 

Mendelian patterns of inheritance.350, 351 

mode of inheritance.351-352 

non Mendelian patterns of inheritance.352 

parametric linkage analysis.353-354 

phenotype.349-365 

replication studies.355, 356 

sample size.350, 356, 362-364 

segregation analysis.350, 351, 357 

twin studies.350 

Genetic equilibrium. See Hardy-Weinberg equilibrium 

Genetic heterogenicity.357 

Genetics ELSI (Ethical, Legal, and Social Issues) 

biobanks.374-378 

budget.379 

definition.369-371 

deliberative democracy approach.376 

GE 3 LS research.370-374 

genetic health literacy.376 

history.371-374 

interdisciplinary team.370, 372, 378, 379 

normative research.370, 372-274, 379 

publication process.379 

public engagement.371, 373-378 


Genetics, Ethical, Economic, Environmental, Legal and 
Social Issues (GE 3 LS). See Genetics ELSI 
(Ethical, Legal, and Social Issues) 


descriptive research.372 

integrated research.463 

stand-alone research.371, 372 

Genome Canada.274,275,279, 

371,372 

Genomewide association studies.350 

Genomic imprinting.352 

Good Clinical Practice (GPC).21 

membership guidelines.21 

Governance 

of ethics review.21 

policies.473 

reporting.278-279 

Grades of recommendation assessment, development and 

evaluation (GRADE). 388, 446, 447, 496 

Grading, clinical practice guideline.3, 250, 328, 


443-452,461,491, 496,497 


Grants 

budget allocation.278 

clinical research.274 

review.275 

Grey literature.424, 425 

Grounded theory methodology.302-304 

Group sequential methods.181 


H 

Hard outcomes.9,10,209, 266, 445 

Hardy-Weinberg equilibrium.361 

Hazards proportionality.109 

Health care 

costs.315,318, 323-325,478 

decision makers.327, 432 

funding.274, 322, 327, 328, 463 

impact of the therapy on costs of.315,322, 323, 326 

and medical devices.417, 419 

policy.170, 473 

problems, 

public financing of.192 

technologies (see Health technology assessment) 

Health economics. See also Economic evaluation 

allocative efficiency.317-320 

cost-benefit study.318 

cost-effectiveness study.318 

cost-minimization analysis.318 

cost-utility study.317 

decision analysis.323, 324 

opportunity cost.316-317, 320 

quality adjusted life year (QALY).319-322, 

324-326 

randomized controlled trials.319 

technical efficiency.317, 318, 320 

value for money.318, 320, 323, 327, 328 

Health outcomes.17, 36,290, 301, 308, 317, 

324, 326, 333, 374, 417,418,420,428-430,445, 
450,458,461-464,475,476,490,495 
Health policy 

Canada Health Act.504, 505 

checklist.515-516 

communication with policy makers.513-515 

definition.502, 506 

evaluation.511-512 

evidence-based decision making.501-515 

example.501-507, 509, 510 

implementation.510-511 

knowledge translation.514 

planning.509-510 

policy arena.505-509, 511 

policy problem assessment 

public policy.502-507, 511, 512 

public policy purpose.503-505 

researcher role.512-513 

stakeholders.505-511, 514 

Health-related quality of life (HRQOL) 

criteria for evaluation.196 

domains for general measures.198 

guidance document, for measurement.196 

instruments used for.197 

SF-6D and versions.198 
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Health status.26, 43,177,192,193,197, 


199,262,263,267,271, 306-308, 393,470 
Health technology assessment 

basic framework for.419-433 

checklist.431, 432 

cost-benefit analysis in.430 

cost-effectiveness analysis in.430 

cost-utility analysis in.430-431 

databases.424-426 

data collection.428, 429, 435, 437 

decision making framework.434 

definition.418 

dissemination strategies.436 

economic analysis in.429-432 

evidence interpretation in.435 

evidence synthesis.435 

ethical issues.422-423 

examples.419, 423, 427 

findings.432-433, 435 

identifying topics.419-420 

monitoring.433 

problems.434 

recommendations.432 

search strategy.425-426 

social issues.421-422 

sources of evidence.423, 425 

specification.420 

Health Utilities Index.319 

Healthy years equivalent (HYE).268 

Heart outcomes prevention evaluation 

(HOPE).163,173 

Helsinki Declaration.20 

Hemodialysis, grounded theory study 

for. 17,20,129,170,253,255, 303-310, 

317,318,477,478,496 
Hereditary nonpolyposis colorectal cancer 

(HNPCC).303,357 

Heterogeneity test.410 

Hierarchy of evidence. 13-14, 54, 386, 387 

Historical cohort.53, 57, 212 

Hosmer-Lemeshow chi square.148 

HTA reports.433 

Hypothesis-generating analyses.189 

I 

Identical by descent (IBD).354 

Identity link function.95 

Implantable cardioverter-defibrillator.338 

Incidence rate ratio.80, 85, 89, 90,103,110 

Incompetence, and ethical research.28, 29 

Indirect association studies.358 

identification strategies.352-364 

Indirect costs.278, 492 

Informed consent.19,20,22, 24-29,160, 

200, 304, 375, 378 


Input and output 

possible relationships between.84 

variables measurement.55, 82, 83 

Intention-to-treat analysis.8, 167,180 

Interaction 

coefficient of.85, 87-88 

modeling confounding and.85-86 

in multiplicative models.88-90 

hypothetical cardiac events data as IRR.89, 90 

parameter as measure of..82, 87 

statistical meaning.86-87 

Interactive parameter estimation.251 

Interest phenomenon.6,202 

Interim analyses.169,170,180-183 

Intermediate (surrogate) response.10 

Intermediate variables.9 

Internal validity.11-12 

Intervention bias.335, 342 

Intervention questions, experimental design for.14-16 

Inverse probability of censoring weights.252, 257-258 

Item response theory.193, 204 

K 

Kaplan-Meier (K-M) analysis.213 

Kaplan-Meier method.63, 67,124,125 

Karnofsky performance status.192 

Knowledge to action cycle.459, 460, 486-491, 498 

Knowledge translation 

ADAPTE process.491 

barriers.486, 487, 492-493 

clinical decision support systems.494 

definition.149, 154, 457 

evaluation.460, 498 

evidence to practice gap.491, 493, 495 

example.478, 479, 488, 496, 514 

frameworks.486 

intervention implementation.486, 493, 494 

knowledge products.488-490 

knowledge synthesis.488, 490 

knowledge use sustainability.495 

local context.460, 491 

policy maker integration.464, 486 

problem identification.486 


risk prediction models, 


L 

Lag censoring.251, 257,258 

Latent trait theory.204 

Leber’s optic atrophy.352 

Legal issues 


in CPG development 
in research ethics 
Level of evidence 

in diagnostic studies.16 

meta-analysis of randomized controlled trials for.411 
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of observational studies.447 

sample sizes and follow up for.184 

Level of significance.9, 356 

Licensing, for drugs/health technologies.172 

Likelihood 

of article being valid.320 

based estimation procedures.91, 215, 253 

ofCKD.8 

as function of recombination fraction.354 

and probability.76 

ratio.210,293-294, 353,392 

treatment being successful.322 

Linear model. See also Generalized linear models 

regression coefficients of.96, 97 

three-dimensional representation.83 

Linear predictor.73, 78, 94,101,112,113 

Linear regression.66, 67, 74, 75, 78-80, 86, 

96-101,244, 346 

Linear relationship.74, 77, 79 

Linkage disequilibrium.358 

Linkage methods, for genetic identification.352, 353 

Literature review. See also Systematic review 

in clinical trials.4-5 

in development of CPG.451 

of HTA resources.435 

to identify related research.4 

Living with end-stage renal disease and hemodialysis 

(LESRD-H).306,307 

Logistic function.73, 95, 98-100 

Logistic model. See also Poisson model, for count, check 
for model fail 

Coefficient in logistic regression.79 

Dichotomous response.67 

vs. linear model 

Sigmoid nature and information of error.73 

Structure.98-99 

Logit function.95 

Longitudinal cohort studies.13, 14, 226, 

244-245, 374 

Longitudinal data.71,110, 111, 129 

Longitudinal studies. See also Cohort studies 

analyzing confounders.64 

confidence intervals. 65-67, 72, 91, 97, 

112,114,120,121,127,152,154 

confounder identification.64 

Cox Proportional Hazard Model.109 

diagnostic test assessment.14,17 

log-rank test.63 

multivariate models.68, 71-91 

odds ratio.63, 65, 66, 79,100,101,134 

power.62-64, 68, 81, 84, 97, 

106,115,119,166 

relative risk.63, 65, 66 

risk estimation.65-66 

sample size estimation.64 

survival data analysis.67 


M 

Markov models.129, 431 

Masking.11,32, 39,164, 403 

Matching 

case control studies.60, 61, 65,134-137, 345, 360 

cohort studies. 133,134,138,139 

overmatching.338 

to reduce confounding.134 

Maximum likelihood estimation (MLE).75, 76, 91, 

99,104,215. See also Likelihood 

Memorandum of understanding (MOU).283 

Memorial University of Newfoundland, research grant 

report.276 

Memorial University, research ethics board.462 

Mental component summary (MCS).198 

Meta-analysis 

bias risk.398, 399 

data abstraction.400, 401, 403-404 

evidence matrix.408 

flow diagram.404 

Forest plot.404, 405, 407 

Funnel plot.403, 405, 406 

heterogeneity test.410 

information summary.404-410 

information synthesis.404-410 

limitations.398, 410-413 

meta-regression plot.405-407 

protocol.400, 401, 403 

publication bias.401, 403, 405, 412 

research question.397, 400-402 

strengths.410-411 

study design.400, 402 

study identification.402 

Microsatellite markers, advantage of.355 

Ministry of Health and Long-Term Care, Ontario.433 

Medical Advisory Secretariat (MAS), 

Missing data.11,15,16, 54,129,164,205, 

261,263-267,270 

Modifiers. 11,137, 407, 409 

Mortality, clinical outcomes.183 

Multicolline arity.110 

Multiple measurements.93, 111 

Multivariable model (R 2 statistics), GFR cohort 

study.96 

Multivariate analysis 

additive models.85-88 

competing risk model.118 

confounding.72, 82-90 

Cox regression coefficients.96-97 

Cox regression model.127 

data transformation.81 

exposure-response relationship.72, 79-80 

extended generalized linear models.111-116 

extended survival models.116-128 

frailty models.121 
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Multivariate analysis ( cont .) 

function. 72-77, 79, 80, 82, 84, 85, 87, 88, 90 


generalized linear models.94-96 

general linear model 

coefficients.96-97 

structure.96 

validity.97 

interaction.82-90 

interaction coefficient.87-88 

likelihood.76 

linear predictor.73, 78, 94,112,113 

logistic model structure.98-100 

logistic regression.80, 94, 95, 98-103,105 

logistic regression coefficients.96-97 

Markov chain models.129 

maximum likelihood estimation.75 

model assumption.77, 81 

model choice.78-81 

model random component.78 

model structure.77-78 

multivariate ANOVA.96 

ordinary least squares method.75 

parameter estimates. 74, 76, 79, 81, 91 

Poisson model structure.73, 89 

Poisson regression.80, 88, 91 

Poisson regression coefficients.80 

probability.76 

propensity score matching.138-139 

random effects model.Ill 

regression.71-72 

repeated measures ANOVA.128,129 

reporting methods.90-91 

reporting results.91 

risk prediction models.146 

survival analysis methods.105-106 

time-dependent effects.120 

time series models.129 

time-to-event data functions.94 

time varying covariates.120 

unordered events marginal model.118-119 

variance-corrected models.Ill, 117,120, 121 

Multivariate ANOVA (MANOVA).129 

N 


Narrative reviews and systematic review, 

difference of.398 

National Health and Medical Research Council 

of Australia.444 

National Institute for Clinical Excellence, U.K.322, 448 

National Institutes of Health, U.S.172,274 

National Library of Medicine (NLM), U.S.424, 426 

National Research Act (1974).20 

Natural log links.95 

Nested case-control studies.59-60 


Networking, and clinical research.284 

Neutral trials.187 

Non insulin-dependent diabetes (NIDDM).173, 357 

Non interventional/observational studies.52 

Non parametric method. See Allele-sharing method 

Non participants.56, 404 

Normally distributed errors.95 

Number needed to treat (NNT). 140,186-187, 

237, 392, 409 

Nuremberg Code.19,20 

o 


O’Brien and Fleming method.181,182 

Observational-experimental discrepancy.188 

Observational studies. See also Cohort studies; 
Longitudinal studies 

factors influencing sample size.62-63 

power.62-64 

sample size estimation.62-64 

Sample size for Long-Rank test.63-64 

type I and type II errors.63-64 

Odds ratio (OR).45, 63, 65-67, 79,100,101, 

134,209-211,219,257,296, 362, 363, 392,409 

Ontario health care system.434 

Ontario Health Technology Advisory Committee.436 

recommendations.436 

Open-label run-in periods, and clinical trials.185 

Opportunity costs.316-317, 320 


Parametric method. See Recombinant-based method 


Participants 

confirming interpretive summaries.47 

in control trial.262 

delaying enrollment.163 

demographic data for identification.477 

to enroll eligible.170 

identified based on disease.14, 491 

monitoring.169 

quantitative scale.221 

research.25,26 

selection.7 

Patient Perception of Hemodialysis Scale (PPHS).309 

Patient-reported outcomes 

assessing change.204-205 

characteristics of scales.200-201 

Classical Test Theory.204 

clinically important difference.220 

comparing groups.204-205 

computerized adaptive testing.193,200, 201,204 

construct validity.202 

content validity.202 

criterion validity.202 

floor and ceiling effects.200, 201 
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focus groups.199-200, 202 

health-related quality of life.192,195-199 

HRQOL claims.193,195 

individualized quality of life.199 

individual’s observation of experience.193 

instrument selection.195 

interval consistency reliability.203 

item response theory.193, 204 

measurement of subjective experience.199 

missing data.205 

model for assessment.204 

professional facilitators.199 

PROMIS.199,204 

quality of life domains.199 

scale reliability.200-201 

scales to discriminate individuals.200 

SF-36.197-199 

societal level.192 

supporting claims of therapeutic benefit.194 

test-retest reliability.203 

utility.193,194 

validity.202 

Patients, interventions, controls, and outcomes.4 

Peer-review, of research ethics.412, 459 

Personnel 

agreements with department/institution.119 

costs, salary and benefits.22 

training.25 

Pharmaceutical manufacturers 

economic evaluation of drugs, role in.324 

licensing of drugs, role in.118 

relationship with guidelines committees.9 

Physical Component Summary (PCS).198 

PICO. See Patients, interventions, controls, and outcomes 
PICO(S) model, for development of search 

strategy.118 

Planning 

in clinical trials.184 

of health services.317 

in prognostic biomarker study.211 

Poisson distribution.76, 80, 95,103,104 

Poisson model, for count 

check for model fail.110 

coefficient meaning in.110 

structure of 

likelihood function of 1.105 

MLEs of parameters.104 

model offset.103 

Poisson regression 

goodness-of-fit test for.102,105 

using Framingham data.105 

Policies 

ethical review.19-21 

governance.22 


Population 

and clinical trial.24 

and diagnostic test.36 

Positional cloning.352, 353, 358. See also Genetic 

epidemiology, in complex traits 

Prader-Willi syndrome.352 

Pragmatic trials.241, 428, 438 

Precision ratio.91 

Predictive values, for diagnostic tests.291 

Prentice criterion, to determine surrogate endpoint.223 

Privacy data.280 

Private industry, clinical research grant.274, 275 

Professional organizations, clinical research 

grant.275-276 

Programs for assessment of technology in health 

(PATH).433 

Project management 

biospecimen collection.284 

coordinator.281 

data collection.284-285 

ethics.282 

funding distribution.282 

hiring.281 

leadership.279-281 

memorandum of understanding.283 

milestones.282 

networking.284 

organization.279-281 

public relations.284 

reporting.285 

research plan implementation.283-284 

scientific advisory board.280 

team building.279 

training.282 

Propensity score matching 

advantages.141 

assessing the balance of covariates.142 

association between exposure and outcome.142 

constructing the score.141 

deriving a score.141 

limitations.141 

Proportional hazards model.64, 67,109,147-149,214 

Pseudo-R 2 for non linear models.102 

Public Health Service, U.S.20 

Public relations, importance of.284 

PULSES profile.192 

P value.9,13, 62 


Q 

Qualitative research 

construct validity.311 

example.304-312 

grounded theory.302-304 

item generation.309-310 
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Qualitative research ( cont .) 

methods.301-302 

psychometric analysis.304, 310-312 

reliability.302, 310-312 

scale construction.304, 309-312 

substantive theory.303-307 

Quality 

clinical data.262, 280 

clinical evidence.385 

of CPG document.450 

Quality-adjusted life years (QALYs) 

cost analysis.325 

and CUA.430-431 

Quality control.12,170,172,280,285, 345, 361 

Quality of life analysis 

clinically important differences.268-271 

instruments.261-264 

missing data.261,263-267 

multivariate repeated measures model.271 

outcome prespecification.261,262,264-267, 

269,270 

quality adjusted survival.267-268 

scores.261-271 

SF-36.263 

treatment effect.264-265 

Quality of life, health-related. See Health-related quality of 
life (HRQOL) 

Quality of life instrument.310 

Quality of Reporting of Meta-Analyses (QUORUM) 

methods.399 

Questionnaires.25, 33, 42, 44,191,193,197-199, 

202,205,262,263,265,282-284, 319, 337, 344, 
388,431,450 

R 


Random effect models.Ill, 113,114, 128 

Randomized controlled trial analysis 

alpha spending.181 

baseline characteristics.180,187-188 

baseline characteristics imbalance.187-188, 265 

censoring.183 

composite outcomes.178,184,189 

crossover trial.186 

factorial design.169, 185 

hypothesis-generating analyses.189 

intention-to-treat analysis.180 

interim analyses.169,170,180-183 

number needed to treat.186-187 

observational design analyses.62 

open-label run-in period.185 

power.187, 225-246 

research questions.159,160,273-274 

stopping rules.181-183, 201,245 

stratified design.185-186 

two-tailed hypothesis.178 


Randomized controlled trials design 

adaptive trials.163 

allocation concealment.162 

analysis frequency bias.169-170 

audit.162,170,172 

bias.171 

blinding.164 

block (see Blocking) 

cluster.162 

controlled trial.159-174 

costs.163,170,172 

economic issues.170 

end-point adjudication.173 

factorial design.163 

funding.171-172 

imbalanced group.180,187-188 

loss to followup.164,167,170-171 

multicenter.165,168 

neutral.179 

non inferiority trials.164 

one-tailed, trial design.167 

outcomes.166,170 

parallel-arm trial.15 

planning 

allocations and interventions.165 

inclusion and exclusion criteria.165 

randomization.159, 162-163 

recruitment role.167-168 

reporting.174 

research question.161 

risk factors.160 

sample size estimation.159,167 

stratified.185-186 

subgroup analyses.169 

subjects, characteristics of.179-180 

surrogate markers of.166 

treatment period.168-169 

trial design.162-166 

trial team.172-173 

Random sample.7, 8,220,225 

Random sequences generation.8 

Rank preserving structural failure time model 

(RPSFTM).251,252 

Rasch models.204 

Recall ratio.425 

Receiver operating characteristic (ROC) 

curve.210, 294,295,297 

Recombinant-based method.353, 354 

Recombination fraction.353, 354, 356. See also 

Genetic epidemiology, in complex traits 

Recruitment rate.167-168 

Reference Manager®.426 

RefWorks®.426 

Regression method of analysis 

estimation purpose.74 





























































































Clinical Epidemiology: Practice and Methods 

Index 


531 


intercept (fi Q ) .74 

likelihood and probability.76 

maximum likelihood estimation (MLE).75, 76 

ordinary least square method.75 

parameter estimates.74, 76 

Relative risks.45, 54, 63, 65, 66, 89,101, 

104,105,215, 392, 409,446 

Reliability coefficients.203, 204 

RENAAL clinical trial.173 

Repeated measures.5, 9, 94,110,128,129,169, 

241,271,345 

Reporting 

in clinical research.17 

for proper interpretation and evaluation.17 

Research design, for diagnostic test.14 

Research ethics 

application.29 

application tips.29 

Belmont report.20 

board.22-24,26,27,29,282 

clinical trials.24,26 

development.19-21 

governance of.21 

inclusiveness in research.28-29 

informed consent.24-28 

privacy and confidentiality of..21-22 

sample study.20 

study risk, benefit.23-24 

Research ethics board (REB) 

composition.22 

functions of.23 

of Memorial University.462 

risks and benefits, review of.23-24 

Research participants, ethical 

privacy rights.26 

responsibilities of.26 

withdrawal procedure.27 

Research plan, clinical implementation of.279-281 

Research question 

accuracy.5,14 

bias.5-7 

clinical relevance.3,13 

confounding.6 

construct validity.11 

controls.5-7,15 

diagnostic test.4,14,16,17 

effectiveness.5,12,13,15,16 

efficiency.15 

error.3, 5-9,11,13,15 

external validity.11-12 

framing.3-17 

hard end-points.10 

hierarchy of evidence.13-14 

internal validity.11-12 

interventions.14-15 


longitudinal studies.17 

measurement error.11 

measurements.9-11 

outcomes.10 

patients.4,16 

precision.5-7 

random error.6, 8 

randomization.5, 8,13,17 

randomized controlled trials.3, 5,13,14 

sample size estimation.8-9 

sampling.3, 7-8 

statistical significance.13 

study design.9,12-14 

surrogate markers.10 

systematic error.5-7 

Restriction, in confounding improvement.58 

Review Committees, ethical.21, 22 

Research Ethics Board (REB).21 

Risk factors.4,14, 34, 44, 51-53, 55-61, 

65, 67, 71, 89, 90, 99,146,160,188,208-209, 
333, 336, 337, 339, 343, 364, 388, 393, 395,431, 
458,477,478 
Risk prediction models 

brier score.148 

calibration.148 

C-statistic.148 

discrimination.147-148 

Hosmer-Lemeshow chi square statistic.148 

index.149 

integrated discrimination improvement.152 

knowledge translation.154 

model development.146-149 

net reclassification index.149, 153-154 

selection of variables.150 

validation cohort.151-152 

R 2 statistics.91, 96,102 


Safety, in clinical trials.160 

Safety monitoring.169,183. See also Data safety and 

monitoring committee 

clinical trial data.169,183 

Sample size 

binary outcomes.231, 240, 241 

biologically plausible effect.238 

calculation.231-235 

clinical trial designs.225-246 

composite end points.226,243 

confidence intervals.230,236, 238, 399 

continuous outcomes.231, 299 

diagnostic biomarkers.241 

drop-in.240, 241,250 

dropout.240,241 

estimation.8-9, 52, 62-64,159,164, 

167,216,250,297 
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false negative error.225, 226 

false positive error.225, 226 

infeasible sample size options.241, 242 

longitudinal studies.226,244-245 

loss to follow up.226,239-241 

minimum clinically important effect.237, 238 

multiple primary end-points.226 

negative predictive value.226, 229 

non adherence.250 

non inferiority trials.238-239 

null hypothesis.63, 64,166, 227 

pilot studies.236-237 

positive predictive value.230 

power.225-246 

pragmatic trials.241 

primary end-point.226, 244 

probability theory.225 

prognostic biomarkers.215-216 

publication bias.230 

randomized controlled trials.225-246 

random variation.225 

standardized effect size conventions.238 

survival outcomes.231 

treatment effect.237,240, 241 

treatment effect size estimation.250 

two-group studies.226 

type I error.8, 62,166 

type II error.9,15,166,167 

underpowered studies.62 

variability of outcome variable.62,234, 244 

Scale 

characteristics.200-201 

construction, qualitative database in.309-312 

reliability to discriminate individual.200 

utility.430 

validity.311-312 

Schedule for the evaluation of individual quality of life 

(SEIQoL).199 

Search strategy, designing.425-426 

SF-6D, derivative measure.198 

Short Form-36 (SF-36) Health Survey.197,198 

Single-nucleotide markers (SNPs).355 

genotyping for association-based studies.357-363 

SNPs. See Single-nucleotide markers (SNPs) 

Sponsorship, for guideline development and 

implementation.448 

Staffing. See Personnel 

Standardization of training, methods, and protocol.171 

Standard link functions and inverses.95 

Statements, guideline rules for use of.445 

Statistical methods and results, reporting of 

fit component.78 

point estimates.411 


special checking, models requirement.91 

variability in response.77-83 

Statistical model 

appropriate model to fit data assumption verification and 

model specification check.79 

components of.78 

critical violations of..77 

exposure-response relationship.79-80 

fit portion of linear model.77 

information gain and residual variance.82 

meanings, model fitted to data.81-82 

multidimensional consequences and inputs.83 

multivariate analysis.82-83 

random component of model.80-81 

transformation of data.81 

Statistical significance.3,13, 46, 47, 62,188, 

213,214,219,221,254,269,407 

Steering committee, for clinical trial management.172 

Stopping rules. 169,181-183, 201,245 

Stratification, in confounding improvement.67 

Stratified design 

examples of.185 

structure.109 

Study power (1-beta error).9 

Subgroup analysis, of clinical trial.169 

Substantive theory generation, grounded 

theory in.304-308 

Surrogate outcomes.9,207-223, 323, 445 

Survival analysis 

forms of.108 

key requirements for.105-106 

Survival data, study of..106 

Survival time.37,105,106,108,116,128, 

147,252,253, 340 

Systematic review. See also Fiterature review 

advantage s.410-411 

conduct.399-401 

in HTA.419,423 

limitations.398, 410-413 

scientific quality.401-410 


Target population.4, 5, 7,11, 34-40, 56, 

61,155,216,263, 310, 347, 388,411,444,450 

Technical efficiency.273, 317, 318, 320 

Time-to-event data, functions of..106-108 

Translational research 

bariatric surgery.460, 461, 463, 464 

Canadian Institutes for Health Research.457-459 

definition.456-457 

evidence-based decision making.455-466 

evidence-based practice.455, 458 

example.458, 463 

frameworks.457-458 
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integrated knowledge translation activities.459 

integrated knowledge translation team.460-466 

knowledge-to-action cycle.459, 460 

knowledge translation.457-460 

levels.457 

phases.457, 458 

Treatment effect 

biologically plausible.252 

censoring.252 

clinically important difference.220,230,237, 270 

contamination.253 

contamination-adjusted intention-to-treat 

analysis.257 

crossover.249 

design.250 

EVOLVE trial.253,258 

example.253-258 

intention-to-treat analysis.249, 253 

interactive parameter estimation.251, 253 

inverse probability of censoring weights.252 

lag censoring.251 

non adherence.253-258 

rank preserving structural failure time 

model.251, 252 

sample size estimation.250 

treatment as received analysis.251 

Treatment period.168-169,186 

Treat-to-goal clinical study.51 

Tri-Council Policy Statement (TCPS).21, 23, 28 

risks and benefits, review of.23-24 


Trio design, in family association studies.360 

/Test.64, 96, 97 

Tuskegee study.20 


Type I error. 8,11,13, 62,164,166,167,170,181 

Type II error.9,15, 43, 62,166,167,187,227 

u 

Utility.63,160,182,193,194,198,210,212, 

214,231,262,267,268,290,297,298, 317, 319, 
320, 326, 364, 374, 376-378, 386,430, 431,458 

V 

Validation of biomarkers.211-220 

Validity of study.11,17,27 

external and internal.11-12 

Variables 

confounding.58, 60,61, 65,66, 85,212, 338,343-346 

interacting.85 

types of.60,137 

Variance-corrected models.120,121. See also Extended 

generalized linear models 

Vulnerable populations, in ethical research.20 


w 

Willingness to pay method.267, 319, 430 

Will Rogers phenomenon.32, 40, 335, 341 

z 

Zval.63,64, 355 















































